This week's Smoke rapid tests over 7 consecutive days reveal that DeepSeek V4 Pro dropped directly from 97.08 on the first day to 66.88 on the last day, with an overall trend of -30.2, an average of only 79.8, and volatility as high as 57.8. This decline ranks among the top of all models, forming a stark contrast with its earlier high performance.
Declining Models: DeepSeek and Gemini 2.5 Pro Face Concentrated Issues
DeepSeek V4 Pro's decline is not an isolated case. Gemini 2.5 Pro also fell from 96.63 to 66.2, a trend of -30.4, an average of 75, and volatility of 58.3. Both models experienced single-day troughs on days 4 and 5, and their integrity ratings quickly shifted from pass to warn or even fail, indicating a severe lack of response consistency. GPT-o3 also dropped from 89.44 to 71.06, a trend of -18.4, an average of 75.5, and volatility of 68.7, accompanied by multiple warn integrity ratings.
These declines are directly related to the stability flaws exposed by the models during consecutive rapid tests. The YZ Index's stability dimension measures the standard deviation of scores, not the accuracy itself. The high volatility values of DeepSeek V4 Pro and Gemini 2.5 Pro mean that when answering similar questions multiple times, scores fluctuate significantly, making it difficult to maintain high-level output.
Rising Models: GPT-5.5 and Claude Sonnet 4.6 Steadily Rebound
In contrast to the declines, GPT-5.5 and Claude Sonnet 4.6 showed strong performance. GPT-5.5 rose from 87.41 to 98.88, a trend of +11.5, an average of 81.5, and volatility of 69. Claude Sonnet 4.6 rose from 90.56 to 98.97, a trend of +8.4, an average of 83.8, and volatility of 62.8. Doubao Pro and Grok 4 also achieved moderate growth of +2.3 and +2.9 respectively, with last-day scores both close to 99.
Although these rising models experienced warn integrity ratings, they recovered faster overall. Claude Sonnet 4.6 and GPT-5.5's ratings were mostly stable at pass in the last three days, showing improved adaptability to rapid test questions. Ernie 4.5 made a large leap from 61.25 to 84.39, a trend of +23.1, but its average was only 69, indicating a still weak foundation.
Integrity Rating Fluctuations Become the Biggest Signal
The most noteworthy this week is not the scores themselves, but the changes in integrity ratings. DeepSeek V4 Pro, Gemini 2.5 Pro, Grok 4, and GPT-o3 all experienced repeated switching between pass, warn, and fail. Although Gemini 3.1 Pro's trend was flat, it directly failed on day 3 and turned to warn on day 7. As a threshold for entry, frequent fluctuations in integrity ratings directly affect model credibility assessment.
Among high-volatility models, Claude Opus 4.7 and GPT-5.5 achieved stability scores of 69.9 and 69 respectively, indicating low response consistency. Combined with industry context, although the current rapid test sample is small, seven consecutive days are already enough to reveal the true state of models under high-pressure continuous questioning.
Next Week Full Evaluation Prediction
Based on this week's trends, in next week's Full evaluation, DeepSeek V4 Pro and Gemini 2.5 Pro are likely to continue facing pressure. If the consistency issues cannot be resolved, their core overall_display scores will continue to decline. GPT-5.5 and Claude Sonnet 4.6 are expected to further consolidate their advantages in material constraint and code execution dimensions.
The fluctuations in consecutive rapid tests have already drawn a watershed for the Full evaluation.
The gap between models will continue to widen. Rising models with stable integrity ratings will gain more trust, while those repeatedly declining will need to provide substantial improvements in the next phase.
Data source: YZ Index | Run #139 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接