In the YZ Index Smoke lightweight evaluation of 11 models on July 2, 2026, Gemini 3.1 Pro achieved first place on the main leaderboard with 82.97 points (Execution 75, Material Constraint 92.7), while 豆包 Pro ranked second with 81.98 points (Execution 75, Material Constraint 90.5), with both tied for the highest execution score.
Structural Differences in Execution and Constraint Determine Rankings
The main leaderboard score is calculated as 0.55 × Code Execution + 0.45 × Material Constraint. Gemini 3.1 Pro and 豆包 Pro, with their execution score of 75, obtained an execution contribution of 41.25 points from the formula, directly leading third-place Claude Opus 4.7 (Execution 58.3, execution contribution 32.065). Claude Opus 4.7 achieved a Material Constraint of 97 points, contributing 43.65 points in the constraint dimension, but its execution weakness resulted in only 75.72 points on the main leaderboard.
DeepSeek V4 Pro scored Execution 61.1, Constraint 89.5, main leaderboard 73.88, ranking fourth. GPT-o3 scored Execution 50, Constraint 93.5, main leaderboard 69.58, ranking fifth. It can be seen that every 10-point increase in Execution score boosts the main leaderboard by approximately 5.5 points, while a 10-point increase in Constraint score only boosts by 4.5 points, making the Execution dimension more decisive under that day's weighting.
Analysis of Yesterday's Execution Score Decline
Compared to yesterday, Claude Sonnet 4.6's Execution dropped by 44.5 points from the previous day's level, and its main leaderboard fell by 25.4 points; GPT-5.5 Execution dropped by 39.5 points, main leaderboard fell by 22 points; DeepSeek V4 Pro Execution dropped by 33.4 points, main leaderboard fell by 20 points. These declines directly reduced the Execution contribution, causing the main leaderboard rankings to shift backward.
Qwen3 Max Execution dropped by 31.2 points, Constraint dropped by 9.1 points, main leaderboard fell by 21.3 points. Gemini 2.5 Pro Execution dropped by 26.3 points, Constraint dropped by 13 points, main leaderboard fell by 20.3 points. Among the models with Execution declines, the Constraint scores of Claude Sonnet 4.6 and GPT-5.5 remained at 92.7 and 90.5 respectively, indicating that the Constraint dimension is relatively stable, and the Execution dimension became the key variable for that day's rankings.
Execution Weakness of High-Constraint Models
Claude Opus 4.7, GPT-o3, Claude Sonnet 4.6, GPT-5.5, and Grok 4 all had Constraint scores above 91.7, but their Execution scores were concentrated in the 47.9–58.3 range. The constraint advantages of these models failed to translate into higher main leaderboard scores, reflecting that in the Smoke 10-question quick test, code execution tasks have a more direct impact on the final ranking.
The bottom-ranked 文心一言 4.5 had an Execution of only 20.8 points, despite a Constraint of 86.9, and a main leaderboard score of 50.55. Qwen3 Max had Execution 33.3, Constraint 86.9, main leaderboard 57.42. Low Execution scores directly compressed the upper limit of the main leaderboard.
No abnormal signals were recorded that day. The decline in Execution scores may be due to the adaptation differences of specific models to the day's questions, but the data only shows score changes without providing question details.
Gemini 3.1 Pro and 豆包 Pro, both with Execution scores of 75, jointly demonstrate that a Material Constraint close to 90 points has become the passing line, and it is every additional 10 points in Code Execution that is the decisive increment for main leaderboard rankings.
Data source: YZ Index | Run #208 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接