文心一言4.5 Main Leaderboard Plunges 10.4 Points, Task Expression Dimension Halved from 90 to 46.3

In the June 2026 YZ Index test of 11 models, 文心一言4.5's Smoke Evaluation main leaderboard score dropped from 81.69 to 71.33 points today, a single-day decline of 10.4 points.

Dimension Breakdown: Two Main Leaderboard Indicators Decline Simultaneously

Code Execution dimension dropped from 66.70 to 50.00 points, a decline of 16.7 points; Material Constraint dropped from 100.00 to 97.40 points, a decline of 2.6 points. These two main leaderboard dimensions together caused the overall main leaderboard to fall. Engineering Judgment rose from 44.70 to 72.20 points, an increase of 27.5 points; Task Expression dropped from 90.00 to 46.30 points, a decline of 43.7 points.

Volatility Source Analysis

Smoke Evaluation has only 10 questions per day, 2 questions per dimension, so the daily lottery draw has a significant impact on scores. The simultaneous sharp decline in Code Execution and Task Expression is more likely due to random fluctuations from question sampling rather than systematic degradation of model capability. Material Constraint still maintaining a high of 97.40 points also supports this judgment.

Engineering Judgment has significantly rebounded, and the integrity rating changed from warn to pass, indicating that the model's output stability and compliance in some side leaderboard dimensions have not deteriorated simultaneously. If it were real degradation, it would usually be accompanied by multiple dimensions weakening together, rather than this kind of trade-off.

Need for Continued Attention

The inherent volatility of a single-day 10-question quick test determines that a one-time drop of 10.4 points does not constitute evidence of a cliff-like decline in model capability. It is recommended to observe whether Code Execution and Task Expression remain below the 60-point range in the Smoke data of the next 3-5 trading days. If they stay low for multiple consecutive days, combined with formal evaluation data, can we determine whether there is real degradation.

Currently, 文心一言4.5 is still within the normal fluctuation range, and there is no need to immediately lower its long-term capability expectations.

One lottery fluctuation does not equal model degradation; three consecutive days of lows are the real signal.

Data source: YZ Index | Run #184 | View raw data