In the YZ Index from June 27, 2026, Smoke lightweight evaluation, Claude Opus 4.7 ranked first on the main leaderboard with 97.12 points, achieving a perfect 100 points in code execution and 93.6 points in material constraint.
Perfect Execution with a Constraint Shortcoming
Claude Opus 4.7 scored 100 points in code execution and 93.6 points in material constraint. Calculated using the formula 0.55 × execution + 0.45 × constraint, its main leaderboard score was 97.12. Claude Sonnet 4.6 also scored 100 points in execution, with a constraint score of 92.1, resulting in a main leaderboard score of 96.45. Both models have peaked in the execution dimension, but achieved only 93.6 and 92.1 points in the constraint dimension respectively, directly lowering their overall scores.
A Different Combination of Execution and Constraint in the Mid-tier Group
Three models—豆包 Pro, Gemini 3.1 Pro, and GPT-5.5—tied for 83.37 points on the main leaderboard, each scoring 75 points in execution and 93.6 points in constraint. This structure shows that they match Claude Opus 4.7 in material constraint but lag by 25 points in code execution, resulting in a 13.75-point disadvantage on the main leaderboard.
DeepSeek V4 Pro scored 82.16 points on the main leaderboard, with 75 points in execution and 90.9 in constraint. GPT-o3 scored 81.84 points, with 75 points in execution and 90.2 in constraint. Both models have constraint scores below the 93.6-point range, further widening the gap to the top five.
Ranking Changes Due to Declining Execution Scores
Compared to yesterday, 文心一言 4.5 dropped 23.8 points on the main leaderboard, with execution down 37.5 points and constraint down 7 points from yesterday's level. Gemini 2.5 Pro dropped 22.6 points on the main leaderboard, with execution also down 37.5 points. Qwen3 Max saw a 41.2-point decline in execution and a 22.6-point drop on the main leaderboard. DeepSeek V4 Pro's execution fell by 25 points, and its main leaderboard score dropped by 15.1 points. Grok 4's execution declined by 27.5 points, with a 15.1-point drop on the main leaderboard. The collective decline in execution scores for these models is the main reason for their downward shift in today's ranking.
Constraint Dimension Relatively Stable
Today, all models passed the integrity rating for material constraint. Qwen3 Max scored 95.9 points in constraint, the only model exceeding 95 points, but its execution was only 58.8 points, resulting in a main leaderboard score of 75.5 points. Gemini 2.5 Pro scored 91.4 points in constraint, and 文心一言 4.5 scored 90.2 points, both in the lower-middle range.
In the Smoke evaluation, which covers only 10 quick-test questions on the day, the interplay between execution and constraint strengths has clearly differentiated the scoring structures of various models. The Claude series holds a clear advantage in execution, while the remaining models need to find room for improvement in the execution dimension.
Data source: YZ Index (Winzheng) | Run #200 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接