GPT-o3 Main Leaderboard Drops 18 Points, Doubao Pro Surges 35.8 Overnight to Break into Top Five

GPT-o3 Main Leaderboard Drops 18 Points, Doubao Pro Surges 35.8 Overnight to Break into Top Five

GPT-o3 showed clear anomalies in today's Smoke evaluation, with the main leaderboard directly dropping from around 76 points yesterday to 58.08, and the execution dimension falling from the 90-point range to 47.5. This is not a minor fluctuation, but a near-halving of execution capability.

Execution dimension determines the day's ranking

In the core_overall formula, code execution has a weight of 0.55 and material constraint has a weight of 0.45. Today's top five generally scored in the 95-97.5 range for execution, with GPT-5.5, two Claude versions, and Doubao Pro all achieving a high score of 97.5. GPT-o3's 47.5 directly dragged down its overall score by 32 points, clearly indicating the issue lies in the code execution segment.

Doubao Pro also scored only 47.5 in execution today, but its constraint score jumped 21.5 points from yesterday's low, resulting in a final main leaderboard score of 89.85, successfully entering the top four. This suggests a significant improvement in its material constraint capability during the single-day test, rather than an overall model upgrade.

What the anomaly signals

The sharp drop in GPT-o3 and the change in ERNIE Bot 4.5's credibility rating from fail to warn constitute the two most noteworthy signals today. The former saw its execution score collapse; the latter, though still mid-tier at 88.48 on the main leaderboard, has a credibility threshold now flashing yellow.

Industry expectations for GPT-o3 had been tilted toward reasoning and tool invocation. This near-halving of the execution score may be related to specific scenarios in the 10 code questions tested today. The Smoke evaluation runs fixedly at 3 a.m. daily with a fixed sample set, and fluctuations are typically small. A single-day drop of 18 points is beyond the normal range.

Top-tier landscape remains stable, new models catching up fast

GPT-5.5 continues to hold first place with a score of 90.3, execution 97.5, constraint 81.5, showing no obvious weaknesses in either dimension. Claude Opus 4.7 and Sonnet 4.6 are tied for second with a main leaderboard score of 90.08, indicating Anthropic still has a gap in material constraints, but its execution capability has caught up with GPT-5.5.

Gemini 3.1 Pro and 2.5 Pro rose by 34.7 and 33.7 points respectively, with execution scores improving from the 50-point range to 95, suggesting Google has made targeted optimizations for code execution consistency. Qwen3 Max and DeepSeek V4 Pro remain in the lower segment, with low constraint scores being the main drag.

Execution capability has become the shortest board in current model competition. The drastic single-day score fluctuations expose the instability of some models in real code scenarios.

Today's data once again verifies: a difference of just 3-4 points in material constraint scores can determine the gap between the top five and the mid-tier, while a loss in execution score directly determines elimination.


Data source: YZ Index | Run #129 | View raw data