The most direct finding from this week's 7-day Smoke test is that GPT-5.5, once a consistent leader, now has an average execution score of just 74.6, with a final-day score of 63.89 — a drop of 23.1 points from day one, with both material constraints and code execution weakening.
Top Models Slide Collectively, Execution and Grounding Decline in Tandem
GPT-5.5, GPT-o3, and Grok 4 show trends of -23.1, -20.8, and -17.2 respectively. All three started above 84 points but fell to the 63–67 range by the final day. The data indicates that their grounding dimension deteriorated significantly during the continuous test, with the widest score fluctuations occurring on questions involving complex material constraints (10 questions per day). This contrasts with their reliance on long-context memory in the Full rating, suggesting that the Smoke test is becoming more sensitive to grounding.
Two Claude Models Reverse by 30+ Points, but Stability Becomes a Concern
Claude Opus 4.7 surged from 58.1 on day one to 90.21, a trend of +32.1; Claude Sonnet 4.6 was even more dramatic, rising from 56.44 to 90.66, a trend of +34.2. Both models entered a high plateau after day 4, but their volatility reached 40.8 and 48.3 respectively — far higher than Doubao Pro's 21.2. According to the YZ Index formula, Sonnet 4.6's stability score is as low as approximately 3.4, indicating extremely poor response consistency, with high-score and low-score days alternating repeatedly.
Highest Volatility Models Cluster, Low Stability Directly Impacts Credibility
Aside from Claude, Gemini 3.1 Pro (volatility 43.7), Wenxin Yiyan 4.5 (42.9), and Qwen3 Max (36.4) all belong to the high-volatility group. Gemini 3.1 Pro has an average of 76.5 but experienced a single-day drop of 20 points on day 3, followed by a slow recovery, showing extreme instability in its judgment dimension. DeepSeek V4 Pro, with a trend of +8.5 and volatility of just 17.9, is one of the few models that balances upward movement with relative stability.
Integrity Ratings Recovering from Warn/Fail Becomes a Key Signal
This week, 9 out of 11 models received a warn or fail at some point. Grok 4 briefly failed before returning to pass, while DeepSeek V4 Pro switched between warn and fail twice. By the end of the 7-day period, all models had returned to pass, indicating that the platform is tightening detection of hallucinations and factual errors. However, it also reveals that some models tend to "cram" during continuous rapid testing.
Next Week's Full Rating Prediction: Claude Under High-Pressure, DeepSeek and Doubao May Continue to Gain Share
Based on this week's Smoke trends, if Claude's two models cannot reduce volatility below 25 in the Full rating, their scores above 90 are unlikely to hold. GPT-5.5 needs to recover at least 15 points in grounding, or it will be further squeezed by Doubao Pro (average 86.7, volatility 21.2) and DeepSeek V4 Pro (average 82.7, volatility 17.9). If Qwen3 Max can bring volatility below 25, it could become the biggest dark horse of the week.
The Smoke test has already sounded an early alarm: scores can spike in the short term, but stability and integrity are the long-term prerequisites.
Data source: YZ Index | Run #152 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接