In the Smoke lightweight evaluation on July 1, 2026, Claude Opus 4.7 ranked first on the main leaderboard with a score of 94.82, forming a balanced structure with a code execution score of 94.5 and a material constraint score of 95.2.
Top Three Exhibit Highly Aligned Execution and Constraint
Claude Opus 4.7 and Claude Sonnet 4.6 both scored 94.5 in code execution, with constraint scores of 95.2 and 94.8 respectively, resulting in a difference of only 0.18 points on the main leaderboard. DeepSeek V4 Pro also scored 94.5 in execution, but its constraint of 93 led to a main leaderboard score of 93.83, falling 0.81 points behind the second place.
GPT-5.5 scored 89.5 in execution and 91.2 in constraint, with a main leaderboard score of 90.27, indicating a structural characteristic where constraint slightly exceeds execution.
Clear Divergence: High Constraint, Low Execution
Grok 4 achieved a perfect constraint score of 100, but its execution was only 68.6, resulting in a main leaderboard score of 82.73. Gemini 2.5 Pro scored 97 in constraint and 64.5 in execution, with a main leaderboard score of 79.13. Qwen3 Max scored 96 in constraint and 64.5 in execution, with a main leaderboard score of 78.68.
豆包 Pro scored 95.2 in constraint and 44.5 in execution, with a main leaderboard score of 67.32. Gemini 3.1 Pro scored 94.8 in constraint and 43 in execution, with a main leaderboard score of 66.31. 文心一言 4.5 scored 95.2 in constraint and 41.7 in execution, with a main leaderboard score of 65.78.
Abnormal Fluctuations Compared to Yesterday
Gemini 3.1 Pro fell 32.2 points on the main leaderboard, with execution dropping 57 points. 豆包 Pro dropped 18.6 points on the main leaderboard, with execution declining 38.8 points. Grok 4 dropped 15.3 points on the main leaderboard, with execution falling 31.4 points.
Claude Sonnet 4.6 rose 12.1 points on the main leaderboard, with execution increasing 19.5 points. Claude Opus 4.7 rose 10.8 points on the main leaderboard, with execution increasing 21.7 points. Both Claude models consolidated their top-two positions through a rebound in execution scores.
Ranking Pressure from Structural Imbalance
When constraint scores approach or exceed 95 points, execution becomes the key variable determining main leaderboard rankings. Models with execution below 65 points, even with near-perfect constraint scores, can only remain in the sub-80 range.
文心一言 4.5 has an integrity rating of warn, while the remaining 10 models all have a rating of pass, indicating that most models maintain basic compliance in the material constraint dimension.
The combination ratio of execution and constraint, rather than a perfect score on a single dimension, determines the final ordering of the Smoke leaderboard.
Data source: YZ Index | Run #206 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接