Gemini Main Ranking Plummets 23 Points, Claude Sonnet 4.6 Tops Smoke Quick Test with 97.5 Points

May 20, 2026 327 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Material Constraints Gemini暴跌 Integrity Rating Smoke快测

The results of today's 10-question quick test from Smoke have completely humbled the Gemini series. Gemini 3.1 Pro's main leaderboard score plummeted 23.2 points from yesterday's 97.2, and Gemini 2.5 Pro also dropped 22.6 points, with both execution and material constraints faltering.

Claude Duo Holds Top Two Spots, Nearly Perfect Execution Scores

Claude Sonnet 4.6 and Opus 4.7 claim the top two spots with 97.5 and 96.51 points respectively, both achieving an execution dimension score of 97.5. Sonnet 4.6 achieves a perfect balance between code execution and material constraints, maintaining the highest rank under the weighted formula of 0.55×execution + 0.45×constraints.

Domestic Models Rise Collectively, Qwen and Doubao See Impressive Gains

Doubao Pro ranks third with 96.06 points, while Qwen3 Max's main leaderboard score surged 26.2 points, with its execution dimension jumping from yesterday's low to 96 points. These two models still trail Claude by 3-5 points in material constraints, but their execution capabilities have entered the top tier.

Gemini and Wenxin Show Concentrated Anomalies

Gemini 3.1 Pro's constraint dimension dropped from yesterday's 93.8 to 86.5, showing significant fluctuations for two consecutive days. Wenxin Yiyan 4.5 was directly marked as Fail, with its integrity rating falling from pass to fail, which is rare in the history of Smoke evaluations.

GPT-o3 also shows a yellow warning, with its integrity rating changing from pass to warn, and a constraint score of only 83.3. DeepSeek V4 Pro and GPT-5.5 have also entered the warn zone, indicating that material constraints have become the key bottleneck in distinguishing the true reliability of current models.

Execution scores can be boosted by training, but constraint scores require long-term alignment and engineering validation.

Today's data once again confirms: Claude maintains the highest consistency in lightweight quick tests, while the Gemini series may be in a painful period of version iteration. The decline in Wenxin Yiyan's integrity rating is even more worth continuous monitoring.

If Gemini cannot stop its decline in the next Smoke evaluation, industry expectations for the series' usability will be further lowered.

Data source: YZ Index | Run #124 | View raw data

Gemini Main Ranking Plummets 23 Points, Claude Sonnet 4.6 Tops Smoke Quick Test with 97.5 Points

Claude Duo Holds Top Two Spots, Nearly Perfect Execution Scores

Domestic Models Rise Collectively, Qwen and Doubao See Impressive Gains

Gemini and Wenxin Show Concentrated Anomalies

Related Reviews

Winzheng Index YZ Index Smoke Weekly: ERNIE Bot 4.5 Drops 37.2 Points, Multiple Models Fluctuate Over 28

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

Winzheng Index Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking

Winzheng Index Claude Sonnet 4.6 Leads with 97.53 Points, Material Constraints Drag 文心一言 40 Points Behind