Key finding from this week’s YZ Index v6 main leaderboard: six legacy models exited at once, five new models entered simultaneously, and the top ten rankings underwent a broad replacement in a single week.
Exits and Entries: Data Directly Reflects Iteration Speed
DeepSeek V3’s main leaderboard score dropped from 75.1 in v5 to zero. DeepSeek R1, ERNIE Bot 4.0, Grok 3, Qwen Max, Claude Opus 4.6, and GPT-4o all exited evaluation this week. In the same period, seven new models entered the main leaderboard: Qwen3 Max at 68.5, ERNIE Bot 4.5 at 67.0, DeepSeek V4 Pro at 65.3, Gemini 3.1 Pro at 65.2, Grok 4 at 64.9, Claude Opus 4.7 at 63.9, and GPT-5.5 at 62.9.
This approach of "clearing old scores to zero and scoring new ones from scratch" directly lowered the main leaderboard average by approximately 4.8 points, indicating that the evaluator is replacing old benchmarks with higher-version or newly trained models.
Code Execution Remains the Key Deciding Factor
Currently, the top-ranked model on the main leaderboard, Claude Sonnet 4.6, scores 86.80 in code execution, while 豆包 Pro surpasses it at 89.80 to become the individual code execution leader. DeepSeek V4 Pro scores 86.70 and Grok 4 scores 86.80, both close to 豆包 Pro, showing no discontinuity in code capabilities among new models.
In the material constraint dimension, however, clear divergence appears: GPT-o3 improved by 18.1 points in a single week, indicating targeted optimization in instruction following and context consistency; conversely, 豆包 Pro dropped 5.7 points in material constraints, and Gemini 2.5 Pro fell by 5 points, suggesting some models have regressed in long-context or multi-turn constraints.
Engineering Judgment Side Leaderboard (AI-Assisted Evaluation) Sees Minor Fluctuations
Claude Sonnet 4.6 scores 52.90 in engineering judgment, and Claude Opus 4.7 scores 55.80, with both Claude models maintaining the lead on the engineering judgment side leaderboard. Qwen3 Max, entering for the first time, immediately scored 45.20, tying with Grok 4, indicating that the new model has already approached the top tier in requirement decomposition and solution feasibility.
Main Leaderboard Real Rankings and Weight Impact
Calculated as core_overall = 0.55 × code execution + 0.45 × material constraints, although 豆包 Pro has the highest code execution, its material constraint score of 70.80 drags it down, resulting in a main leaderboard score of 81.25, second only to Claude Sonnet 4.6’s 83.02. Gemini 2.5 Pro ranks at 79.04 on the main leaderboard, with material constraints of 71.50 as its main weakness.
New entrant Qwen3 Max scores 78.98 on the main leaderboard, with code execution at 85.50 and material constraints at 71.00; its overall performance already surpasses GPT-5.5 and DeepSeek V4 Pro, placing it in the top six on debut—a clear disruptive impact.
The collective exit of legacy models is not a failure, but rather vendors concentrating resources on next-generation products; the new model’s debut score of 68.5 means the next round of main leaderboard competition will be even fiercer.
From the current rankings, Claude Sonnet 4.6 and 豆包 Pro remain firmly in the top two, but Grok 4, Claude Opus 4.7, Gemini 2.5 Pro, and Qwen3 Max have already formed a second tier, with gaps among them within 3 points.
Next week, it is worth closely tracking whether GPT-o3’s material constraint improvement of 18.1 points can be sustained, and whether 豆包 Pro can pull its material constraints back above 72 points. If both happen simultaneously, the top three positions on the main leaderboard will be reshuffled again.
The rules of the YZ Index v6 dictate that only two auditable dimensions—code execution and material constraints—determine main leaderboard rankings; the rest are side leaderboards or operational signals. If vendors want to quickly improve their rankings, they must make simultaneous progress in both dimensions—single-point breakthroughs are no longer enough to change the landscape.
Data source: YZ Index | Run #122 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接