ERNIE Bot 4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

May 30, 2026 453 Views - Read Source Winzheng Index

ERNIE Bot 4.5 Code Execution Smoke Test 百度AI 单日波动

In today's Smoke quick test, ERNIE Bot 4.5's main leaderboard score dropped directly from 74 to 62.96, a decline of 11 points, with code execution collapsing from 100 to 50 points, while material constraints only edged up 4.5 points. This is not a minor fluctuation, but a clear cliff in core auditable dimensions.

Sampling Fluctuation or True Degradation?

The Smoke evaluation uses only 10 questions per day (2 questions per main leaderboard dimension), resulting in an extremely small sample size and naturally large daily standard deviation. This time, the two code execution questions may have hit the model's weaker boundary cases—such as complex multi-file dependencies or specific library version conflicts—causing the score to be halved. The improvement in material constraints suggests the model has not systematically regressed in citation constraints.

However, this cannot be entirely attributed to luck. The drop from a perfect score to passing line, a difference of 50 points, far exceeds normal sampling fluctuation ranges. If the score remains in the 50-60 range for two or three consecutive days, it is more likely that the model made a trade-off on code paths in a recent update.

Recent Industry Dynamics Comparison

At the end of March, Baidu reduced the inference cost of ERNIE Bot 4.5 by 30% and emphasized "more stable Chinese long texts." Cost optimization is often accompanied by adjustments in decoding strategy, where some high-difficulty code scenarios are sacrificed for average response speed—this aligns with the timing of the code execution collapse. Meanwhile, domestic competitors DeepSeek-V3 and Qwen2.5-72B have been intensifying their efforts on code benchmarks, and Baidu may have temporarily allocated resources to Chinese scenarios rather than code capabilities.

The integrity rating has changed from fail to pass, indicating that the model did not produce hallucinations or out-of-bounds content in this quick test. This is a positive signal.

Should This Be a Major Concern?

Single-day data is insufficient to determine model degradation, but the code execution dimension directly affects real-world developer use cases. It is recommended to observe for at least three consecutive days. If this dimension cannot return above 80 points, ERNIE Bot 4.5 should be temporarily removed from the "all-round candidate" list, and alternative models with more stable code capabilities should be prioritized.

Among the two side leaderboard dimensions—engineering judgment and task expression—one fell and one rose this time, further indicating that the model's performance across different task types is diverging rather than experiencing an overall decline.

A code execution score of 50 is not the end, but if it stays in this range for three consecutive days, ERNIE Bot 4.5 will truly cede the developer user base to others.

Data source: YZ Index (YZ Index) | Run #138 | View raw data

ERNIE Bot 4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

Sampling Fluctuation or True Degradation?

Recent Industry Dynamics Comparison

Should This Be a Major Concern?

Related Reviews

Winzheng Index Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

Winzheng Index Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Winzheng Index GPT-o3 Smoke Evaluation Main Leaderboard Plunges 8.3 Points, Code Execution Drops from 100 to 88.3