ERNIE 4.5 in today's Smoke evaluation dropped from 88.48 to 61.25 on the main leaderboard, a single-day decline of 27.2 points. The core driving factor is the code execution dimension, which fell directly from 95.00 to 50.00, while material constraint slightly declined from 80.50 to 75.00.
Sampling Fluctuation or Genuine Degradation
Smoke evaluation has only 10 questions per day, 2 questions per dimension, with an extremely small sample size. Large single-day fluctuations are within the normal range. However, this time the code execution dimension lost 45 points in one go, far exceeding the 5.5-point drop in material constraint, indicating that the problem is concentrated on code-related tasks. Yesterday's 95 points meant the model achieved near-perfect scores on both code questions, while today's 50 points likely indicates severe errors or refusals on both questions.
If high-difficulty code questions are drawn for two consecutive days, the model's score naturally falls; if the model itself has issues with code generation consistency, a longer observation window is needed. With only single-day data, it cannot be directly judged as capability degradation.
Recent Industry Dynamics Impact
Baidu recently shifted its update focus for ERNIE 4.5 toward multimodal understanding and long-text summarization, reducing resources for dedicated code capability optimization. Meanwhile, the iteration speed of other domestic models on code benchmarks has accelerated, objectively raising the difficulty baseline for similar questions. The two side dimensions, engineering judgment and task expression, each rose by 20 points today, confirming that the model's response strategy on non-code tasks may have been adjusted.
The integrity rating changed from warn to pass, indicating that the model's responses this time showed no significant hallucinations or compliance violations, with basic compliance actually improving.
Should We Pay Close Attention
Currently, it is judged as a high-probability sampling fluctuation, but it is recommended to observe for 3–5 consecutive trading days. If the code execution dimension consistently stays below 70 points, a dedicated retest should be initiated to confirm whether there are adjustments to training data or alignment strategies.
The single-day 27.2-point fluctuation itself does not constitute evidence of model capability collapse, but it exposes the sensitivity of Smoke evaluation under small samples. Follow-up conclusions should be drawn based on weekly leaderboard data with larger samples.
A halving of code execution may just be the price of sampling, but continuous monitoring is the only standard for judging the true state of a model.
Data source: YZ Index | Run #130 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接