Today's Smoke lightweight evaluation results directly shattered market expectations for stable performance of mainstream models. Doubao Pro took the absolute top spot with 91.23 points, scoring a perfect 100 in code execution, 80.5 in material constraint, and passing the integrity rating, making it the only model with zero errors in the code section of the 10-question quick test.
Execution Dimension Collectively Sluggish, Test Difficulty May Have Escalated
The remaining models performed poorly on code execution: Gemini 3.1 Pro scored 57.2 in execution, the second highest, while others such as Claude Sonnet 4.6, Grok 4, Qwen3 Max, and GPT-5.5 all stalled at 50 points; Gemini 2.5 Pro and 文心一言 4.5 even scored 0 points outright. This is not a simple ranking shift, but a断层式 collapse in execution capability.
Compared to yesterday's data, Gemini 2.5 Pro plunged 54.3 points on the main leaderboard, DeepSeek V4 Pro dropped 36.2 points, 文心一言 4.5 fell 36.7 points, and Grok 4 and Qwen3 Max also declined by 34.7 and 34.3 points respectively. The collective halving or zeroing of execution scores points to a significant increase in today's 10-question code task difficulty, rather than a sudden failure of the models themselves.
Material Constraints Relatively Stable, but Anomalies Emerge
On the material constraint dimension, most models remained in the 70–81 point range, with Claude Sonnet 4.6 leading at 81 points, followed closely by Gemini 3.1 Pro and Claude Opus 4.7. However, Claude Opus 4.7's constraint score plunged 17.6 points in a single day, indicating noticeable fluctuations even for models with strong constraint capabilities.
In terms of integrity ratings, Gemini 2.5 Pro, 文心一言 4.5, and Qwen3 Max shifted from warn or fail to pass, suggesting some improvement in compliant output. However, this did not offset the massive losses in the execution dimension.
Industry Insight: Code Ability Becomes a New Dividing Line
Doubao Pro's perfect execution score validates its continuous optimization on engineering tasks. The lack of consistency among other leading models in complex code scenarios exposes the limitations of current training and alignment strategies. Today's evaluation felt more like a stress test, revealing the vulnerability of most models under real-world engineering constraints.
Overall, Doubao Pro has established a clear generational advantage. If other models want to catch up, they must make targeted breakthroughs in the robustness of code execution, otherwise the gap on the main leaderboard will continue to widen.
Code execution is no longer a bonus; it is the battlefield that determines survival.
Data source: YZ Index (YZ Index) | Run #127 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接