This week's 7-day Smoke Quick Test data from 10 questions directly reveals polarization: Wenxin Yiyan 4.5 surged from 32.63 to 86.05 with a trend of +53.4, becoming the biggest dark horse, while GPT-o3 dropped from 91.81 to 84.03 with a decline of -7.8, leading the fall among all mainstream models.
Rising Camp: Three Models Break Through Against the Trend
Claude Sonnet 4.6 and Doubao Pro tied for second place, both recording a trend of +23.9 and a final score of 86.05. Sonnet had an average of 84.5 and volatility of 36.1, indicating that it is gradually stabilizing its output logical chain in consecutive tests. Doubao Pro averaged 81.6 with a volatility of 31.3, showing that its execution dimension in Chinese scenarios is rapidly catching up.
The explosion of Wenxin Yiyan 4.5 is most noteworthy. On the first day, it scored only 32.63, but on the final day, it was on par with top models. Its average of 57.2 comes with an extremely high volatility of 57, indicating highly uneven daily performance, likely exhibiting a cliff-like "all correct or all wrong" pattern on specific question types.
Declining Camp: GPT-o3 and Grok Have the Biggest Problems
GPT-o3 has an average of only 80.7, volatility of 29.2, and trend of -7.8, the largest decline among all models. The initial advantage of 91.81 on the first day began to erode after day 4, suggesting a systematic loosening in its material constraint dimension. Grok 4 has an average as low as 61.5, with volatility of 79.2 being the highest of all. Its integrity rating even showed two consecutive days of fail, indicating that its response consistency has severely collapsed.
Claude Opus 4.7 and Gemini 3.1 Pro, while still at high levels, have trends of -3.4 and -4.3 respectively. The gap between their averages of 88.8 and 83.6 is narrowing. Qwen3 Max performed relatively steadily, with the lowest volatility of 13.9, but its trend is still -1.3, failing to hold its initial advantage.
Integrity Rating Becomes Key Risk Signal
Within 7 days, 6 models recorded fail or warn status. Grok 4 failed for two consecutive days, Gemini 3.1 Pro and DeepSeek V4 Pro each had one fail, and GPT-5.5 had three consecutive warns near the end. These changes are not accidental, reflecting a significant decline in the grounding capability of models when facing continuous quick tests.
What is particularly concerning is that some models did not see a synchronous recovery in scores after their integrity ratings improved. This indicates that warn/fail is not simple data noise, but a real degradation of the model's underlying capabilities in specific scenarios.
Prediction for Next Week's Full Evaluation
Based on current trends, if Wenxin Yiyan can control its volatility below 30, it is expected to enter the top five in next week's Full Evaluation. Conversely, if its high volatility continues, it is likely to be overtaken by Claude Sonnet and Doubao Pro. For GPT-o3 and Grok, it is crucial to observe whether their execution dimensions can stop declining, otherwise they will be further squeezed out of the first tier.
Seven consecutive days of small-sample testing have been sufficient to expose the real stability differences among models. Those models with volatility exceeding 30 points in the Smoke test are very likely to see the gap further amplified in Full Evaluation scenarios involving long contexts and complex reasoning.
Short-term bursts can only bring weekly hype; steady climbing determines monthly rankings.
Data source: YZ Index | Run #119 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接