Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard

Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard

The most striking data in today's Smoke light evaluation is that Ernie Bot 4.5's execution score dropped directly from 100 yesterday to 50, and the main leaderboard score plummeted 11 points from around 73.96 to 62.96. This is not a minor fluctuation but an obvious collapse in core capabilities.

Behind the Halved Execution Score: Abnormal Signals from Ernie Bot

The execution dimension accounts for 55% of the main leaderboard weight. Ernie Bot's current score of 50 means at least half of the 10 code execution questions failed to pass. Compared with yesterday, the execution dimension dropped 50 points in a single day, while the constraint dimension slightly increased, indicating that the problem is concentrated in the code generation and verification stages. Possible reasons include reduced compatibility with tool call formats after a model update, or tightened internal security policies causing code output truncation. Either way, this exposes its lack of engineering consistency.

GPT-o3 and GPT-5.5 Recover in Sync

GPT-o3's main leaderboard score rose 35.8 points in a single day, with the execution dimension +50 and the constraint dimension +18.5, almost completely filling yesterday's low. GPT-5.5 also rose 13.4 points, with the constraint dimension improving by 29.8 points. The simultaneous recovery of both models points to OpenAI's recent unified optimization of the reasoning chain. It is worth noting that their material constraint scores are still 2-3 points behind Claude, indicating room to catch up in terms of strictly following user materials without fabricating content.

Claude Duo Continues to Dominate the Top Two

Claude Opus 4.7 scores 99.42 on the main leaderboard, with execution 100 and constraint 98.7, remaining firmly in first place for multiple days. Claude Sonnet 4.6 follows closely with 99.01. Both models have material constraint scores above 97, far ahead of the third tier. This once again validates Anthropic's long-term accumulation in alignment and constraint. Doubao Pro, with 98.43 points, squeezed into the top five, with a constraint score of 96.5 and its integrity status changed from warn to pass, showing its material following ability in Chinese scenarios is approaching international first-tier levels.

Collective Bottleneck for Mid-Tier Models

Gemini 3.1 Pro and Qwen3 Max both have main leaderboard scores around 92, with constraint scores stuck in the 82-83 range. The gap from the top five mainly comes from material constraint rather than execution. DeepSeek V4 Pro has a constraint score of 79.8, also stuck at this bottleneck. The industry is forming a clear stratification: the top five models have achieved near-perfect execution scores, and the next stage of competition will revolve entirely around material constraint.

Execution scores can be quickly fixed, but constraint capabilities require long-term alignment investment.

Today's data once again confirms this assessment. If Ernie Bot wants to return to the first tier, it must solve the execution consistency problem in the next update; otherwise, it will continue to fall behind.


Data source: YZ Index | Run #138 | View raw data