In the June 17, 2026 test of 11 models by YZ Index, Qwen3 Max's material constraint score plummeted from 100 points yesterday to 71.1 points, and its main leaderboard score was only 73.25 points, making it the most prominent anomaly of the day.
Structural Differences in Execution and Constraint Determine Rankings
Claude Opus 4.7 achieved a perfect 100 points in code execution and 100 points in material constraint, securing 100 points on the main leaderboard. The formula 0.55×100+0.45×100 gave it an uncontested lead. Gemini 2.5 Pro, Gemini 3.1 Pro, and GPT-5.5 all scored 98.83 points on the main leaderboard, with perfect execution scores of 100 points, but constraint scores of 97.4 points, showing a highly consistent structure.
GPT-o3, Claude Sonnet 4.6, and DeepSeek V4 Pro tied with 100 points in execution, with constraint scores of 94.8 points and 94 points respectively, placing their main leaderboard scores in the 97.66 to 97.3 point range. 豆包 Pro, however, showed a reverse structure: 91.7 points in execution and 100 points in constraint, with a main leaderboard score of 95.44 points, highlighting the weighting contribution of material constraint to the final score.
Yesterday's Comparison Reveals Signs of Execution Recovery
Gemini 2.5 Pro and Gemini 3.1 Pro each gained 53.8 points on the main leaderboard, with execution scores jumping from yesterday's unknown baseline directly to 100 points. GPT-5.5 gained 28.8 points on the main leaderboard, with execution rising to 100 points. DeepSeek V4 Pro gained 27.3 points on the main leaderboard, with execution also rising to 100 points. GPT-o3 gained 25.2 points on the main leaderboard, with execution rising to 100 points, but constraint dropped by 5.2 points.
These gains are primarily driven by perfect execution scores, indicating that some models had clear shortcomings in code execution tasks yesterday and have now completed recovery.
Anomalous Signals Point to Constraint Volatility
Qwen3 Max's material constraint score plummeted by 28.9 points, directly causing its main leaderboard score to drop from a possible high yesterday to 73.25 points. 文心一言 4.5 saw a sharp decline of 10.4 points on the main leaderboard, with an execution score of only 50 points and a constraint score of 97.4 points. The calculation 0.55×50+0.45×97.4 placed it at the bottom.
Grok 4 had an execution score of 66.7 points and a constraint score of 96.7 points, resulting in a main leaderboard score of 80.2 points, with its execution shortfall significantly dragging down the overall performance. These data indicate that a sudden decline in material constraint is harder to quickly recover from than execution volatility.
A perfect execution score of 100 has become standard for mainstream models, and the variance in constraint scores is emerging as the new differentiator.
Today's Smoke data once again validates: when execution scores converge, the stability of material constraint directly determines the final ranking on the main leaderboard.
Data source: YZ Index | Run #184 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接