This week's most prominent signal from the Smoke quick test comes from GPT-5.5: within seven days, it climbed from 60.58 on the first day to 90.3 on the last day, a net gain of 29.7 points, with an average of 73.6 but a clear upward channel. This stands in stark contrast to the collective decline of most other models over the same period.
Declining Camp: GPT-o3 and DeepSeek Lead the Drop
GPT-o3 fell from 94.51 on the first day to 58.08 on the last day, a trend of -36.4, with an average of only 73.8, the largest decline among all models. DeepSeek V4 Pro also dropped significantly, from 93.03 on the first day to 74 on the last day, a trend of -19, with an average of 81.1. More notably, its integrity rating: multiple warns occurred in the first five days, and it directly failed on the sixth day, continuing through the seventh. This is not an accidental fluctuation, but a clear signal of degradation in the model's performance during consecutive quick tests.
Doubao Pro is also worth noticing. From 97.75 on the first day to 89.85 on the last day, a trend of -7.9, with an average of 85.4, but its volatility is as high as 43.7, indicating a significant decline in answer consistency. The integrity rating was pass for six consecutive days and warn for only one day, which seems to barely hold the line, but the core capability is slowly eroding.
Rising Camp: Four Models Buck the Trend
Apart from GPT-5.5, ERNIE Bot 4.5 rose from 74 to 88.48 (+14.5), Gemini 3.1 Pro from 75 to 88.7 (+13.7), and Qwen3 Max from 77.84 to 84.2 (+6.4). All four show a positive trend despite relatively low averages, indicating that their underlying capabilities are still in an iteration window.
Among them, GPT-5.5 and Gemini 3.1 Pro have the steepest upward curves, with their last-day scores approaching or exceeding those of most established models. This suggests to users that the current Smoke quick test is sensitive to new version iterations, and a dramatic ranking reshuffle of "latecomers overtaking the old" may occur in the short term.
Volatility Reveals Stability Risks
The stability dimension (max(0,100-stddev×2)) directly reflects the score dispersion of a model when answering similar questions repeatedly. Gemini 2.5 Pro volatility 61.1, ERNIE Bot 4.5 volatility 55, and Doubao Pro volatility 43.7 are all far higher than GPT-5.5's 30.9. This means that when facing similar questions, the output quality of the former three fluctuates greatly, leading to highly unstable user experience.
High volatility is often accompanied by repeated integrity rating fluctuations. Gemini 2.5 Pro failed on the third day and only recovered on the fifth day; ERNIE Bot 4.5 failed three times and warned twice in seven days, indicating obvious shortcomings in both the material constraint and code execution mainboard dimensions.
Next Week Full Evaluation Prediction
Based on the current trend, GPT-5.5 and Gemini 3.1 Pro are expected to continue eating into mid-range positions in next week's Full evaluation, while GPT-o3 and DeepSeek V4 Pro face continued bleeding risks. In particular, DeepSeek's consecutive fail integrity record may trigger a stricter material constraint review, directly affecting its mainboard ranking.
Industry background shows that the second quarter of 2026 is a dense iteration window for multiple vendors. The Smoke quick test has already captured the upward momentum of GPT-5.5 and Qwen3 Max ahead of time. It is expected that in the Full evaluation, the two auditable dimensions—execution and grounding—will further amplify the current trend gaps.
Seven days of Smoke data have given the answer: not all models are improving. The ones that dare to drop their scores to 58 are the real signals that users need to watch out for.
Data source: YZ Index (YZ Index) | Run #129 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接