Three Models Plunge by 28 Points, Claude Still Near Perfect Score

May 16, 2026 23 Views - Read Source Winzheng Index

Claude Sonnet 4.6 GPT-5.5 代码执行材料约束 Smoke评测

The most striking news today is not Claude's victory, but three leading models stalling simultaneously: GPT-5.5 dropped 28 points on the main leaderboard, and DeepSeek V4 Pro dropped 19.4 points.

At 3:00 AM on May 16, the Winzheng Index Smoke lightweight evaluation completed a 10-question rapid test on 11 mainstream models. This round only examines two auditable core dimensions: code execution and material constraint. The main board formula is core_overall = 0.55 × code execution + 0.45 × material constraint. This means that once execution capability declines, the main board score will be quickly dragged down.

The first tier is no longer "leading" but "dominating"

Claude Sonnet 4.6 takes first place with 98.34, scoring 100 in code execution, 96.3 in material constraint, and passing the integrity rating. Claude Opus 4.7 follows closely with a main board score of 97.75, also 100 in execution and 95 in constraint. The gap between them is only 0.59 points, but their lead over the third-place model, Doubao Pro, has widened to over 4 points.

This is not just a ranking advantage—it's a structural advantage. Both Claude models are nearly flawless in two key areas: "can it run" and "does it follow the material." Smoke only has 10 questions—light in volume—but the faster the test, the more it exposes a model's engineering reliability in its default state. Claude's performance today shows that it doesn't rely on a burst of performance on specific question types, but rather maintains its upper limit through low error rates.

Today's top three on the main board: Claude Sonnet 4.6 at 98.34, Claude Opus 4.7 at 97.75, and Doubao Pro at 93.48. Doubao Pro, in third place, scored 100 in execution but only 85.5 in material constraint, which is the main source of its gap with Claude.

The real cliff is in execution: GPT-5.5 and DeepSeek both drop to 50

The most concerning performance today comes from GPT-5.5 and DeepSeek V4 Pro. GPT-5.5's main board score is 56.08, a drop of 28 points from yesterday. The core reason is straightforward: its code execution score fell 50 points from yesterday's level, leaving only 50 today. DeepSeek V4 Pro's main board score is 54.64, down 19.4 points, with execution also dropping to 50.

In the Winzheng Index main board, code execution carries a 55% weight. This is not because of a bias toward programmers, but because execution questions have the least room for ambiguity: if the result works, it works; if it doesn't, no amount of explanation helps. The drops seen in GPT-5.5 and DeepSeek today indicate that the problem lies not in "response style," but in a breakdown of verifiable output.

Such fluctuations usually have three possible causes: first, model routing changes, where users receive a different capability tier; second, adjustments in safety or tool policies that cause execution tasks to be handled more conservatively; third, recent server-side updates introducing regressions. Regardless of the cause, this is unfriendly to developers because what developers fear most is not slowness, but that something worked yesterday and fails today.

Gemini 3.1 Pro wins on execution, loses on material constraint

Gemini 3.1 Pro scores 85.96 on the main board today, with 100 in execution but only 68.8 in material constraint. GPT-o3 scores 84.48 on the main board, with 100 in execution and 65.5 in material constraint. Their problem is similar: they perform well on coding questions, but when strictly required to follow the material and avoid extrapolation, they start losing points.

This serves as a reminder for enterprise users: if your use case involves code generation, script fixing, or structured processing, Gemini 3.1 Pro and GPT-o3 remain competitive. However, if the scenario involves compliance Q&A, research report summarization, or contract clause extraction, a low material constraint score amplifies risk. A model being "smart" does not mean it "follows the rules."

Qwen3 Max scores 85.39 on the main board, with 87.5 in execution, 82.8 in material constraint, and a warn integrity rating. Its performance is balanced, but its gatekeeping signal requires continued observation.
Gemini 2.5 Pro scores 74 on the main board, with 100 in execution and 74.3 in material constraint, but its integrity rating is fail. The issue is not a lack of capability, but a failure to pass the threshold.
Grok 4 scores 47.75 on the main board. Although it surged 36.5 points from yesterday, it still ranks at the bottom with 50 in execution and 45 in material constraint.

An anomaly with Grok: The data metrics themselves must also be audited

Today's anomaly log states "Grok 4: Integrity rating downgraded to Fail," yet the same data's ranking for today shows a pass, and the yesterday-to-today comparison reads fail→pass. These two pieces of information conflict. Following the auditable details, this article adopts the "integrity rating today is pass, improved from yesterday" interpretation. However, this itself is worth noting: if evaluation reports are to become an industry benchmark, both models and data annotations must be audited.

Grok 4's main board surge of 36.5 points sounds dramatic, but with today's execution score of 50 and material constraint of 45, it still ranks last. The so-called rebound is more about moving from an unusually low point back into the observable range than a breakthrough in capability. For buyers, such a model should not enter the core production chain just because of a single day's large increase.

Conclusion: In 2026, model competition is about making fewer mistakes

Today's Smoke rapid test sends a clear signal: the gap among top-tier models no longer mainly comes from "whether they can answer," but from "whether they can consistently deliver verifiable results under constraints." The Claude duo holds near-perfect scores through 100 in execution and high material constraint scores. Doubao Pro proves that domestic models can achieve full marks in execution. GPT-5.5 and DeepSeek V4 Pro remind the industry that a flagship title cannot shield against execution regression.

My judgment is straightforward: In the next three months, enterprise model selection will shift from "who is the smartest" to "who stumbles the least." The second half of the model war rewards not inspiration, but deliverability.

Data source: YZ Index | Run #118 | View raw data

Three Models Plunge by 28 Points, Claude Still Near Perfect Score

The first tier is no longer "leading" but "dominating"

The real cliff is in execution: GPT-5.5 and DeepSeek both drop to 50

Gemini 3.1 Pro wins on execution, loses on material constraint

An anomaly with Grok: The data metrics themselves must also be audited

Conclusion: In 2026, model competition is about making fewer mistakes

Related Reviews

Winzheng Index GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

Winzheng Index Claude Sonnet 4.6 Material Grounding Plunges 27.5 Points, But Main Leaderboard Rises Against the Trend by 1.4 Points?

Winzheng Index Claude Sonnet 4.6 Code Execution Plunges 25 Points: Model Degradation or Evaluation Artifact?

Winzheng Index AI Big Models in Turmoil! Wenxin Yiyan Soars 24.7 Points but Integrity Collapses, Gemini Drops 16 Points in Three Consecutive Declines