ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

May 25, 2026 359 Views - Read Source Winzheng Index

ERNIE Bot 4.5 Code Execution Smoke Test 百度大模型单日波动

ERNIE 4.5 in today's Smoke evaluation dropped from 88.48 to 61.25 on the main leaderboard, a single-day decline of 27.2 points. The core driving factor is the code execution dimension, which fell directly from 95.00 to 50.00, while material constraint slightly declined from 80.50 to 75.00.

Sampling Fluctuation or Genuine Degradation

Smoke evaluation has only 10 questions per day, 2 questions per dimension, with an extremely small sample size. Large single-day fluctuations are within the normal range. However, this time the code execution dimension lost 45 points in one go, far exceeding the 5.5-point drop in material constraint, indicating that the problem is concentrated on code-related tasks. Yesterday's 95 points meant the model achieved near-perfect scores on both code questions, while today's 50 points likely indicates severe errors or refusals on both questions.

If high-difficulty code questions are drawn for two consecutive days, the model's score naturally falls; if the model itself has issues with code generation consistency, a longer observation window is needed. With only single-day data, it cannot be directly judged as capability degradation.

Recent Industry Dynamics Impact

Baidu recently shifted its update focus for ERNIE 4.5 toward multimodal understanding and long-text summarization, reducing resources for dedicated code capability optimization. Meanwhile, the iteration speed of other domestic models on code benchmarks has accelerated, objectively raising the difficulty baseline for similar questions. The two side dimensions, engineering judgment and task expression, each rose by 20 points today, confirming that the model's response strategy on non-code tasks may have been adjusted.

The integrity rating changed from warn to pass, indicating that the model's responses this time showed no significant hallucinations or compliance violations, with basic compliance actually improving.

Should We Pay Close Attention

Currently, it is judged as a high-probability sampling fluctuation, but it is recommended to observe for 3–5 consecutive trading days. If the code execution dimension consistently stays below 70 points, a dedicated retest should be initiated to confirm whether there are adjustments to training data or alignment strategies.

The single-day 27.2-point fluctuation itself does not constitute evidence of model capability collapse, but it exposes the sensitivity of Smoke evaluation under small samples. Follow-up conclusions should be drawn based on weekly leaderboard data with larger samples.

A halving of code execution may just be the price of sampling, but continuous monitoring is the only standard for judging the true state of a model.

Data source: YZ Index | Run #130 | View raw data

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

Sampling Fluctuation or Genuine Degradation

Recent Industry Dynamics Impact

Should We Pay Close Attention

Related Reviews

Winzheng Index Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

Winzheng Index Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Winzheng Index 4模型执行分暴跌至50，文心一言主榜狂掉34.1分