Doubao Pro tops Smoke benchmark with 98.61 points, Claude's Execution plummets to 50 points

Jun 28, 2026 37 Views - Read Source Winzheng Index

Doubao Pro Claude Opus 执行维度 Material Constraints Smoke Test

2026-06-28 Smoke lightweight benchmark results show Doubao Pro topped the main leaderboard with 98.61 points (Execution 100 points, Material Constraint 96.9 points), with its perfect Execution score serving as the core advantage.

Score Structure Comparison

Doubao Pro's Execution score of 100 and Constraint score of 96.9 are nearly balanced, and the weighted result of 0.55×100+0.45×96.9 directly widens the gap with other models. Gemini 3.1 Pro scored 91.21 points on the main leaderboard (Execution 91.7, Constraint 90.6), with a difference of only 1.1 points between the two dimensions, making it the most balanced structure. DeepSeek V4 Pro scored 87.35 points on the main leaderboard (Execution 83.3, Constraint 92.3), with Constraint stronger than Execution.

GPT-5.5 scored 84.18 points on the main leaderboard (Execution 75, Constraint 95.4). Grok 4 and GPT-o3 also show Constraint scores of 95.4 but Execution scores between 72 and 75. Claude Opus 4.7 and Sonnet 4.6 have Constraint scores of 97.7 and 95.6 respectively, yet due to Execution scores of 50, their main leaderboard scores are only 71.47 and 70.52 points.

Yesterday's Changes Analysis

Claude Opus 4.7 dropped 25.7 points on the main leaderboard, and Sonnet 4.6 dropped 25.9 points, both due to the Execution dimension dropping directly from 100 points yesterday to 50 points, while Material Constraint remained high. ERNIE Bot 4.5's Execution dropped from 62.5 points yesterday to 35.6 points, causing a 13.5-point drop on the main leaderboard. Doubao Pro's Execution, on the other hand, rose from 75 points yesterday to 100 points, driving a 15.2-point increase on the main leaderboard.

Fluctuations in the Execution dimension directly affect main leaderboard rankings, while the Constraint dimension generally remains above 90 points without significant decline.

Anomaly Signal Interpretation

The halving of Execution scores for both Claude models may reflect a decrease in consistency on code execution tasks in that day's 10-question quick test. ERNIE Bot 4.5's Execution score of 35.6 also indicates large volatility on the execution side. Doubao Pro's perfect Execution score may stem from stable output on similar tasks.

The Material Constraint dimension remains high overall, with Claude Opus 4.7's 97.7 points being the highest of the day, indicating that this dimension still provides an advantage for most models.

The gap between Execution scores of 50 and 100 has become the most direct driver of today's Smoke benchmark rankings.

Data source: YZ Index | Run #201 | View raw data

Doubao Pro tops Smoke benchmark with 98.61 points, Claude's Execution plummets to 50 points

Score Structure Comparison

Yesterday's Changes Analysis

Anomaly Signal Interpretation

Related Reviews

Winzheng Index Doubao Pro Material Constraint Plunges 15.9 Points: Causes of Smoke Single-Day Test Anomaly

Winzheng Index 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

Winzheng Index Qwen3 Max Material Constraint Plunges 26.7 Points, Code Execution Rises to 100 Points