In the YZ Index Smoke tests of 11 models from June 23 to 28, 2026, ERNIE Bot 4.5 dropped from 98.74 on Day 1 to 61.52 on Day 7, a trend of -37.2, with an average of only 82.1, and a fluctuation of 37.2, making it the model with the largest decline.
Most Models Saw Collective Decline on the Final Day
Claude Sonnet 4.6 scored 94.87 on Day 1, 70.52 on Day 7, trend -24.4, average 87.5, fluctuation 28.4. Claude Opus 4.7 scored 100 on Day 1, 71.47 on Day 7, trend -28.5, average 89.8, fluctuation 28.5. Gemini 2.5 Pro dropped from 96.18 to 81.41, trend -14.8, average 90.9, fluctuation 22.6. Gemini 3.1 Pro dropped from 100 to 91.21, trend -8.8, average 90.5, fluctuation 30.7. GPT-5.5 dropped from 96.18 to 84.18, trend -12, average 92.8, fluctuation 14.7. GPT-o3 dropped from 96.81 to 82.53, trend -14.3, average 91.6, fluctuation 17. Grok 4 dropped from 100 to 82.97, trend -17, average 93.3, fluctuation 18.5. DeepSeek V4 Pro dropped from 99.37 to 87.35, trend -12, average 94.2, fluctuation 17.8. Qwen3 Max dropped from 74 to 69.94, trend -4.1, average 81, fluctuation 28.1. Only Doubao Pro slightly increased from 98.07 to 98.61, trend 0.5, average 95.8, fluctuation 16.6.
High-Fluctuation Models Require Close Attention
ERNIE Bot 4.5 fluctuation 37.2, Gemini 3.1 Pro fluctuation 30.7, Claude Opus 4.7 fluctuation 28.5, Claude Sonnet 4.6 fluctuation 28.4, and Qwen3 Max fluctuation 28.1 all exceeded 28 points. These models showed large score standard deviations over 7 days, indicating significant differences in repeated answers to similar questions. With only 10 questions per day in Smoke, the sample is small, but data from 7 consecutive days already shows that final-day scores are generally below the average, suggesting a decline in model consistency during continuous testing.
Integrity Rating Changes Send Signals
ERNIE Bot 4.5 integrity rating changed from warn to pass, and Qwen3 Max changed from fail to warn and then to pass. Both models showed positive or stable changes in integrity ratings, but their score trends remained negative. Integrity ratings serve only as a threshold and do not directly add points. Even after rating improvements, scores still dropped significantly, indicating that model capability fluctuations are independent of the integrity dimension.
Possible Trends in Next Week's Full Evaluation
The 7-day Smoke trend shows that models with fluctuations exceeding 28 points are likely to continue experiencing score volatility in the Full evaluation. Doubao Pro, with an average of 95.8 and a trend of 0.5, may be the only model able to maintain a high level. The Claude, Gemini, and GPT series averages range from 87.5 to 93.3, but final-day scores have already dragged down overall performance. In the industry context, model iterations are typically accompanied by short-term stability declines, and the current data supports this inference.
Seven consecutive days of Smoke data indicate that AI model stability has shifted from an auxiliary indicator to a decisive constraint.
If the next week's Full evaluation uses the same question distribution, the final rankings of high-fluctuation models may see significant adjustments, while Doubao Pro has the opportunity to maintain its lead with its low-fluctuation advantage.
Data source: YZ Index (YZ Index) | Run #201 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接