In today's Smoke quick test, 文心一言4.5's main leaderboard score dropped directly from 74 to 62.96, a decline of 11 points, with code execution collapsing from 100 to 50 points, while material constraints only edged up 4.5 points. This is not a minor fluctuation, but a clear cliff in core auditable dimensions.
Sampling Fluctuation or True Degradation?
The Smoke evaluation uses only 10 questions per day (2 questions per main leaderboard dimension), resulting in an extremely small sample size and naturally large daily standard deviation. This time, the two code execution questions may have hit the model's weaker boundary cases—such as complex multi-file dependencies or specific library version conflicts—causing the score to be halved. The improvement in material constraints suggests the model has not systematically regressed in citation constraints.
However, this cannot be entirely attributed to luck. The drop from a perfect score to passing line, a difference of 50 points, far exceeds normal sampling fluctuation ranges. If the score remains in the 50-60 range for two or three consecutive days, it is more likely that the model made a trade-off on code paths in a recent update.
Recent Industry Dynamics Comparison
At the end of March, Baidu reduced the inference cost of 文心一言4.5 by 30% and emphasized "more stable Chinese long texts." Cost optimization is often accompanied by adjustments in decoding strategy, where some high-difficulty code scenarios are sacrificed for average response speed—this aligns with the timing of the code execution collapse. Meanwhile, domestic competitors DeepSeek-V3 and Qwen2.5-72B have been intensifying their efforts on code benchmarks, and Baidu may have temporarily allocated resources to Chinese scenarios rather than code capabilities.
The integrity rating has changed from fail to pass, indicating that the model did not produce hallucinations or out-of-bounds content in this quick test. This is a positive signal.
Should This Be a Major Concern?
Single-day data is insufficient to determine model degradation, but the code execution dimension directly affects real-world developer use cases. It is recommended to observe for at least three consecutive days. If this dimension cannot return above 80 points, 文心一言4.5 should be temporarily removed from the "all-round candidate" list, and alternative models with more stable code capabilities should be prioritized.
Among the two side leaderboard dimensions—engineering judgment and task expression—one fell and one rose this time, further indicating that the model's performance across different task types is diverging rather than experiencing an overall decline.
A code execution score of 50 is not the end, but if it stays in this range for three consecutive days, 文心一言4.5 will truly cede the developer user base to others.
Data source: YZ Index (YZ Index) | Run #138 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接