11 Models in Transition: Grok 4 Tops the Charts, DeepSeek Series Exits En Masse

Jun 8, 2026 493 Views - Read Source Winzheng Index

Grok 4 Code Execution 新模型首秀主榜排名工程判断

The most direct signal from this week's YZ Index v6 main ranking is: older models exit en masse, while new models pour in all at once. Among the seven debut models, Qwen3 Max with 80.9 points, Grok 4 with 79.0 points, and ERNIE Bot 4.5 with 79.0 points directly enter the top tier, pushing seven older models—DeepSeek V3, R1, 文心 4.0, Grok 3, Qwen Max, Claude Opus 4.6, and GPT-4o—out of the evaluation pool in one go.

New Models Debut with High Scores, Older Models Exit Faster Than Expected

Core formula: core_overall = 0.55 × 代码执行 + 0.45 × 材料约束. This week, new models generally scored in the high range of 87–94 on 代码执行. Doubao Pro scored 94.60 on 代码执行, Grok 4 scored 93.90, and Qwen3 Max scored 89.70—all higher than the departing GPT-4o (59.8) and Claude Opus 4.6 (61.6). The same holds for 材料约束: Claude Opus 4.7 reached 87.50, far surpassing the older Claude.

This is not incremental iteration but a direct reflection of a version generation gap. Older models were generally stuck in the 70–75 range on 材料约束 by the end of 2025, while new models have raised the ceiling to 85+ in one go, causing the old rankings to become invalid within a single week.

The Real Foundation Behind Grok 4's Top Spot

Currently ranked first, Grok 4 has a main score of 89.90, 代码执行 93.90, 材料约束 85.00, and 工程判断 82.10. It trails only Doubao Pro in 代码执行, but leads Doubao Pro by 3.4 points in 材料约束. This 0.45 weight gives it a 1.53-point advantage, directly pushing Doubao Pro to third place.

Claude Opus 4.7 closely follows with 89.04 points. Its 材料约束 of 87.50 is currently the highest, and its 工程判断 (side ranking, AI-assisted evaluation) of 93.10 is also the strongest. However, its 代码执行 score of 90.30 lags behind Grok 4 by 3.6 points, resulting in a second-place finish with a gap of 0.86 points.

Side Ranking Signal: Significant Divergence in Task Expression

GPT-o3's 任务表达 surged by 62.5 points in a single week, Claude Sonnet 4.6 rose by 57.8 points, and Gemini 2.5 Pro increased by 54.6 points. These gains far exceed changes in the main ranking, indicating that models still have room for rapid iteration in instruction following and multi-turn conversation consistency.

It is worth noting that the 稳定性 dimension (calculated based on score standard deviation) is not directly reflected in the main ranking this week. However, fluctuations in repeated answers to similar questions still need continuous tracking. A model with a stability score of 31.7 may experience output drift in actual deployment.

Who Will Be the Variable Next Week

Among the seven new models, GPT-5.5 and ERNIE Bot 4.5 currently rank 10th and 11th, with 代码执行 scores of 81.90 and 78.00 respectively, leaving room for improvement of 5–8 points. If they maintain their iteration pace next week, the top five positions in the main ranking will be further squeezed.

After the mass exit of older models, the "generation gap" in the evaluation pool has been flattened in one go. Future rankings will depend more on weekly increments than on historical accumulation.

New models debut at the top, older models are wiped out in a week—by 2026, AI rankings have entered the stage where "weekly updates determine survival."

Data source: YZ Index (YZ Index) | Run #154 | View Raw Data

11 Models in Transition: Grok 4 Tops the Charts, DeepSeek Series Exits En Masse

New Models Debut with High Scores, Older Models Exit Faster Than Expected

The Real Foundation Behind Grok 4's Top Spot

Side Ranking Signal: Significant Divergence in Task Expression

Who Will Be the Variable Next Week

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

Winzheng Index Claude Opus 4.7 Main Benchmark Plummets 19.9 Points, Code Execution Drops 25 Points in a Single Day

Winzheng Index Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Winzheng Index Claude Sonnet 4.6 Smoke Main Ranking Plunges 15.3 Points, Code Execution Drops 25 Points in a Single Day