2026-06-28 Smoke lightweight benchmark results show Doubao Pro topped the main leaderboard with 98.61 points (Execution 100 points, Material Constraint 96.9 points), with its perfect Execution score serving as the core advantage.
Score Structure Comparison
Doubao Pro's Execution score of 100 and Constraint score of 96.9 are nearly balanced, and the weighted result of 0.55×100+0.45×96.9 directly widens the gap with other models. Gemini 3.1 Pro scored 91.21 points on the main leaderboard (Execution 91.7, Constraint 90.6), with a difference of only 1.1 points between the two dimensions, making it the most balanced structure. DeepSeek V4 Pro scored 87.35 points on the main leaderboard (Execution 83.3, Constraint 92.3), with Constraint stronger than Execution.
GPT-5.5 scored 84.18 points on the main leaderboard (Execution 75, Constraint 95.4). Grok 4 and GPT-o3 also show Constraint scores of 95.4 but Execution scores between 72 and 75. Claude Opus 4.7 and Sonnet 4.6 have Constraint scores of 97.7 and 95.6 respectively, yet due to Execution scores of 50, their main leaderboard scores are only 71.47 and 70.52 points.
Yesterday's Changes Analysis
Claude Opus 4.7 dropped 25.7 points on the main leaderboard, and Sonnet 4.6 dropped 25.9 points, both due to the Execution dimension dropping directly from 100 points yesterday to 50 points, while Material Constraint remained high. ERNIE Bot 4.5's Execution dropped from 62.5 points yesterday to 35.6 points, causing a 13.5-point drop on the main leaderboard. Doubao Pro's Execution, on the other hand, rose from 75 points yesterday to 100 points, driving a 15.2-point increase on the main leaderboard.
Fluctuations in the Execution dimension directly affect main leaderboard rankings, while the Constraint dimension generally remains above 90 points without significant decline.
Anomaly Signal Interpretation
The halving of Execution scores for both Claude models may reflect a decrease in consistency on code execution tasks in that day's 10-question quick test. ERNIE Bot 4.5's Execution score of 35.6 also indicates large volatility on the execution side. Doubao Pro's perfect Execution score may stem from stable output on similar tasks.
The Material Constraint dimension remains high overall, with Claude Opus 4.7's 97.7 points being the highest of the day, indicating that this dimension still provides an advantage for most models.
The gap between Execution scores of 50 and 100 has become the most direct driver of today's Smoke benchmark rankings.
Data source: YZ Index | Run #201 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接