2026-06-18 Smoke lightweight benchmark shows that Claude Opus 4.7, DeepSeek V4 Pro, 豆包 Pro, and GPT-o3 all scored 100 in both code execution and material constraint dimensions, with a total score of 100 on the main leaderboard.
Structural Characteristics of Perfect-Score Models
The four models have perfectly balanced execution and constraint, with no weaknesses under the formula 0.55×execution + 0.45×constraint. Claude Sonnet 4.6 follows closely with a main leaderboard score of 98.83, also scoring 100 in execution and 97.4 in constraint. The gap comes solely from a 2.6-point loss on the constraint side under the 0.45 weight.
Gemini 3.1 Pro and GPT-5.5 are tied for 97.53 on the main leaderboard, with 100 in execution and 94.5 in constraint. Their constraint side is 5.5 points lower than the perfect-score models, directly pulling down the main leaderboard score by 2.47 points.
Perfect Execution vs. Differentiated Constraint
Among today's 11 models, 10 achieved 100 in the execution dimension, with only 文心一言4.5 scoring 97.4. However, the constraint dimension ranges from 100 down to 71.1, a maximum gap of 29 points. 文心一言4.5 has identical execution and constraint scores of 97.4, resulting in a main leaderboard score of 97.4. Its structure is the most balanced but its absolute score is relatively low.
Gemini 2.5 Pro scores 100 in execution and 91.8 in constraint, with a main leaderboard score of 96.31. Grok 4 and Qwen3 Max both score 100 in execution and 71.1 in constraint, with a main leaderboard score of 87, making them the lowest today.
Comparison with Yesterday and Anomaly Signals
Compared to yesterday, 文心一言4.5's main leaderboard score increased by 26.1 points, mainly driven by a 47.4-point improvement in execution. Qwen3 Max's main leaderboard score rose by 13.8 points, with execution improving by 25 points. Grok 4's main leaderboard score increased by 6.8 points, but its constraint dimension plunged by 25.6 points, offsetting the 33.3-point gain in execution.
The sharp drop in constraint directly resulted in Grok 4's material constraint being only 71.1 points, losing about 11.5 points on the main leaderboard under the 0.45 weight, creating a 13-point gap with the perfect-score models. 豆包 Pro's main leaderboard score increased by 4.6 points, with execution improving by 8.3 points. DeepSeek V4 Pro's main leaderboard score increased by 2.7 points, with constraint improving by 6 points.
Structural Insights
As the execution dimension approaches saturation, material constraint has become the dividing line on the main leaderboard. The four perfect-score models made no compromise in either dimension, while Claude Sonnet 4.6 and Gemini 3.1 Pro traded a slight loss in constraint for ranking. The perfect 100 in execution for Grok 4 and Qwen3 Max could not compensate for their 71.1 in constraint, indicating outstanding code execution ability but significantly lagging material constraint capability.
Today, 文心一言4.5 improved both execution and constraint simultaneously, showing the most significant structural improvement. Grok 4's constraint plummeted in a single day, revealing clear instability in material constraint tasks.
When the execution dimension is universally at perfect scores, even a small gap in material constraint determines the divide between the top four and the following seven on the main leaderboard.
Data source: YZ Index (YZ Index) | Run #186 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接