The most direct signal from this week's YZ Index v6 main ranking is: older models exit en masse, while new models pour in all at once. Among the seven debut models, Qwen3 Max with 80.9 points, Grok 4 with 79.0 points, and 文心一言 4.5 with 79.0 points directly enter the top tier, pushing seven older models—DeepSeek V3, R1, 文心 4.0, Grok 3, Qwen Max, Claude Opus 4.6, and GPT-4o—out of the evaluation pool in one go.
New Models Debut with High Scores, Older Models Exit Faster Than Expected
Core formula: core_overall = 0.55 × 代码执行 + 0.45 × 材料约束. This week, new models generally scored in the high range of 87–94 on 代码执行. 豆包 Pro scored 94.60 on 代码执行, Grok 4 scored 93.90, and Qwen3 Max scored 89.70—all higher than the departing GPT-4o (59.8) and Claude Opus 4.6 (61.6). The same holds for 材料约束: Claude Opus 4.7 reached 87.50, far surpassing the older Claude.
This is not incremental iteration but a direct reflection of a version generation gap. Older models were generally stuck in the 70–75 range on 材料约束 by the end of 2025, while new models have raised the ceiling to 85+ in one go, causing the old rankings to become invalid within a single week.
The Real Foundation Behind Grok 4's Top Spot
Currently ranked first, Grok 4 has a main score of 89.90, 代码执行 93.90, 材料约束 85.00, and 工程判断 82.10. It trails only 豆包 Pro in 代码执行, but leads 豆包 Pro by 3.4 points in 材料约束. This 0.45 weight gives it a 1.53-point advantage, directly pushing 豆包 Pro to third place.
Claude Opus 4.7 closely follows with 89.04 points. Its 材料约束 of 87.50 is currently the highest, and its 工程判断 (side ranking, AI-assisted evaluation) of 93.10 is also the strongest. However, its 代码执行 score of 90.30 lags behind Grok 4 by 3.6 points, resulting in a second-place finish with a gap of 0.86 points.
Side Ranking Signal: Significant Divergence in Task Expression
GPT-o3's 任务表达 surged by 62.5 points in a single week, Claude Sonnet 4.6 rose by 57.8 points, and Gemini 2.5 Pro increased by 54.6 points. These gains far exceed changes in the main ranking, indicating that models still have room for rapid iteration in instruction following and multi-turn conversation consistency.
It is worth noting that the 稳定性 dimension (calculated based on score standard deviation) is not directly reflected in the main ranking this week. However, fluctuations in repeated answers to similar questions still need continuous tracking. A model with a stability score of 31.7 may experience output drift in actual deployment.
Who Will Be the Variable Next Week
Among the seven new models, GPT-5.5 and 文心一言 4.5 currently rank 10th and 11th, with 代码执行 scores of 81.90 and 78.00 respectively, leaving room for improvement of 5–8 points. If they maintain their iteration pace next week, the top five positions in the main ranking will be further squeezed.
After the mass exit of older models, the "generation gap" in the evaluation pool has been flattened in one go. Future rankings will depend more on weekly increments than on historical accumulation.
New models debut at the top, older models are wiped out in a week—by 2026, AI rankings have entered the stage where "weekly updates determine survival."
Data source: YZ Index (YZ Index) | Run #154 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接