11-Model Generational Battle: No. 1 Holds Steady, Grok Falls to the Bottom

May 11, 2026 28 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Grok 4 主榜排名模型评测代码执行

The real shock this week was not who “surged,” but that after generational upgrades, the strong became stronger while the weak were directly left behind.In 2026-W20, the YZ Index included 11 models, with the main ranking still based only on two auditable dimensions: code execution and material constraint. The formula is Main Ranking = 0.55×Code Execution + 0.45×Material Constraint. This means that whether a model can write correct code and follow the provided materials is what truly counts.

No. 1 Has Not Changed, but the Champion Is Not Safe

Claude Sonnet 4.6 remains in first place with 83.54, scoring 86.60 in code execution and 79.80 in material constraint. Its advantage is not that one metric is off the charts, but that both stay at a high level: it can handle code well and does not drift when reading materials.

But second-place 豆包 Pro has already reached 82.63, only 0.91 points behind. More importantly, 豆包 Pro’s code execution score is 88.30, higher than Claude Sonnet 4.6. This shows that in pure coding tasks, 豆包 Pro is no longer just part of a “domestic alternative” narrative; it has genuinely entered the top tier.

The point of conflict within the leading group this week is clear: Claude Sonnet 4.6 wins on material constraint, while 豆包 Pro wins on code execution.

The gap comes from material constraint: Claude Sonnet 4.6 scores 79.80, while 豆包 Pro scores 75.70, a difference of 4.10 points. Since material constraint accounts for 45% of the main ranking weight, those 4 points are enough to keep 豆包 Pro outside the championship spot.

The Upgrade List Is Lively, but Do Not Mistake “Joining” for “Improvement”

In this week’s change list, 文心一言 4.5 shows Main Ranking ↑72, DeepSeek V4 Pro ↑65.2, Qwen3 Max ↑64.9, Gemini 3.1 Pro ↑63.6, Claude Opus 4.7 ↑62.5, GPT-5.5 ↑59.6, and Grok 4 ↑41.5. Conversely, DeepSeek V3, DeepSeek R1, 文心一言 4.0, Grok 3, Qwen Max, Claude Opus 4.6, and GPT-4o all show declines.

This must be made clear: these large changes mainly come from first-time inclusion in, or exit from, the evaluation, not from the same model suddenly surging or collapsing this week. Interpreting “exit from evaluation” as a performance decline is a misreading; interpreting “first-time inclusion” as an improvement within the week is also a misreading.

What is truly worth watching is where the new lineup stands in the current ranking. Claude Opus 4.7 ranks third with 81.12, Gemini 3.1 Pro ranks fourth with 79.24, and Gemini 2.5 Pro ranks fifth with 78.45. 文心一言 4.5 scores 78.17, DeepSeek V4 Pro scores 77.73, and Qwen3 Max scores 77.21; the three are clustered within 1 point, forming the second group.

GPT-o3’s Material Constraint Is the Hardest Real Change This Week

Among comparable changes, the one most worth watching is GPT-o3: material constraint +20.9. Its current main ranking score is 75.69, with 77.80 in code execution and 73.10 in material constraint, placing it ninth. Its ranking is not particularly high, but the substantial repair in material constraint is important, because this ability directly affects whether a model can answer based on given evidence rather than filling in by intuition.

If GPT-o3 can raise code execution in tandem later on, it has a chance to move from the third group toward the second group. The current issue is that its engineering judgment (side ranking, AI-assisted evaluation) is 51.30. Its side-ranking performance is not bad, but the main ranking is still constrained by the two hard metrics of code and material.

Grok 4’s 49.20 Is Not a Minor Mistake

The most glaring data point this week is Grok 4: main ranking 49.20, code execution 53.70, material constraint 43.70, ranking 11th. It trails 10th-place GPT-5.5, which scores 73.20, by a full 24.00 points. This is not “slightly weaker”; it is a cliff-like gap.

Even more troublesome is that Grok 4 is low on both main-ranking dimensions: code execution has not held up, and material constraint is an even bigger drag. For a model intended to enter serious production environments, a material constraint score of 43.70 means it carries high risk when completing tasks based on provided materials.

Side-Ranking Signals: Sonnet and 豆包 Are Shoring Up Weaknesses

This week, Claude Sonnet 4.6’s engineering judgment (side ranking, AI-assisted evaluation) rose by +10.2, while 豆包 Pro’s engineering judgment (side ranking, AI-assisted evaluation) rose by +10.1. This shows that leading models are not only optimizing for the main ranking, but also improving their ability to make trade-offs in complex tasks. However, it must be emphasized that engineering judgment is a side ranking and is not included in the main-ranking calculation; it cannot be used to replace code execution and material constraint.

Gemini 2.5 Pro’s code execution -5.4 is a warning sign. Its current main ranking score is still 78.45, but code execution has fallen to 79.80. If material constraint cannot continue to provide support later, its position in the second group will be under sustained pressure from 文心一言 4.5, DeepSeek V4 Pro, and Qwen3 Max.

This Week’s Conclusion: The Focus of Competition Has Shifted from “Who Can Talk” to “Who Makes Fewer Mistakes”

This week’s ranking reveals three signals: first, Claude Sonnet 4.6 is still the strongest overall, but 豆包 Pro has narrowed the championship gap to within 1 point; second, GPT-o3’s repair in material constraint deserves continued observation; third, Grok 4 currently lacks the foundation to compete with the mainstream first and second groups.

In addition, stability is not included in the main ranking. It measures the consistency of a model’s responses across multiple similar questions and is calculated based on the standard deviation of scores; it is not accuracy. Treating the stability score as answer accuracy is a misreading of what the evaluation means.

What is most worth watching in the coming week is not who shouts the loudest, but who can make fewer mistakes in both code execution and material constraint; the model war has entered the stage where error rate determines the outcome.

Data source: YZ Index (YZ Index) | Run #112 | View raw data

11-Model Generational Battle: No. 1 Holds Steady, Grok Falls to the Bottom

No. 1 Has Not Changed, but the Champion Is Not Safe

The Upgrade List Is Lively, but Do Not Mistake “Joining” for “Improvement”

GPT-o3’s Material Constraint Is the Hardest Real Change This Week

Grok 4’s 49.20 Is Not a Minor Mistake

Side-Ranking Signals: Sonnet and 豆包 Are Shoring Up Weaknesses

This Week’s Conclusion: The Focus of Competition Has Shifted from “Who Can Talk” to “Who Makes Fewer Mistakes”

Related Reviews

winzheng.com GPT-4o Code Execution Plummets 23.7 Points: Version Update Triggers Performance Avalanche

Winzheng Index Weekly AI Model Test: GPT-4o Plummets 10 Points in Material Constraints, Domestic Wenxin Bucks the Trend

Winzheng Index 11 Major AI Models SQL Consecutive Login Challenge: 8 Full Scores, 3 Crashes – Stunning Code Execution Gap

Winzheng Index GPT-o3 Drops from 100 to 0 on One Problem, Yet the Main Board Rises