GPT-5.5 Plunges 23 Points, Two Claude Models Surge 34 Points: 7-Day Smoke Data Reveals Real Trends

Jun 7, 2026 513 Views - Read Source Winzheng Index

Claude Opus 4.7 稳定性 Smoke 评测 Integrity Rating Changes 周趋势预测

The most direct finding from this week's 7-day Smoke test is that GPT-5.5, once a consistent leader, now has an average execution score of just 74.6, with a final-day score of 63.89 — a drop of 23.1 points from day one, with both material constraints and code execution weakening.

Two Claude Models Reverse by 30+ Points, but Stability Becomes a Concern

Claude Opus 4.7 surged from 58.1 on day one to 90.21, a trend of +32.1; Claude Sonnet 4.6 was even more dramatic, rising from 56.44 to 90.66, a trend of +34.2. Both models entered a high plateau after day 4, but their volatility reached 40.8 and 48.3 respectively — far higher than Doubao Pro's 21.2. According to the YZ Index formula, Sonnet 4.6's stability score is as low as approximately 3.4, indicating extremely poor response consistency, with high-score and low-score days alternating repeatedly.

Highest Volatility Models Cluster, Low Stability Directly Impacts Credibility

Aside from Claude, Gemini 3.1 Pro (volatility 43.7), Wenxin Yiyan 4.5 (42.9), and Qwen3 Max (36.4) all belong to the high-volatility group. Gemini 3.1 Pro has an average of 76.5 but experienced a single-day drop of 20 points on day 3, followed by a slow recovery, showing extreme instability in its judgment dimension. DeepSeek V4 Pro, with a trend of +8.5 and volatility of just 17.9, is one of the few models that balances upward movement with relative stability.

Integrity Ratings Recovering from Warn/Fail Becomes a Key Signal

This week, 9 out of 11 models received a warn or fail at some point. Grok 4 briefly failed before returning to pass, while DeepSeek V4 Pro switched between warn and fail twice. By the end of the 7-day period, all models had returned to pass, indicating that the platform is tightening detection of hallucinations and factual errors. However, it also reveals that some models tend to "cram" during continuous rapid testing.

Next Week's Full Rating Prediction: Claude Under High-Pressure, DeepSeek and Doubao May Continue to Gain Share

Based on this week's Smoke trends, if Claude's two models cannot reduce volatility below 25 in the Full rating, their scores above 90 are unlikely to hold. GPT-5.5 needs to recover at least 15 points in grounding, or it will be further squeezed by Doubao Pro (average 86.7, volatility 21.2) and DeepSeek V4 Pro (average 82.7, volatility 17.9). If Qwen3 Max can bring volatility below 25, it could become the biggest dark horse of the week.

The Smoke test has already sounded an early alarm: scores can spike in the short term, but stability and integrity are the long-term prerequisites.

Data source: YZ Index | Run #152 | View Raw Data

GPT-5.5 Plunges 23 Points, Two Claude Models Surge 34 Points: 7-Day Smoke Data Reveals Real Trends

Top Models Slide Collectively, Execution and Grounding Decline in Tandem

Two Claude Models Reverse by 30+ Points, but Stability Becomes a Concern

Highest Volatility Models Cluster, Low Stability Directly Impacts Credibility

Integrity Ratings Recovering from Warn/Fail Becomes a Key Signal

Next Week's Full Rating Prediction: Claude Under High-Pressure, DeepSeek and Doubao May Continue to Gain Share

Related Reviews

Winzheng Index Claude Opus 4.7 Leads with Average Score of 86.9, GPT-o3 Drops 30.5 Points in 7 Days

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Claude Opus 4.7 Main Benchmark Plummets 19.9 Points, Code Execution Drops 25 Points in a Single Day