文心一言4.5 Smoke Main Ranking Plunges 22.2 Points, Code Execution Halved to 50 Points

In the June 2026 YZ Index testing of 11 models, 文心一言4.5 Smoke's main ranking score dropped from 93.25 to 71.02, a single-day decline of 22.2 points.

Core Data Breakdown

The code execution dimension fell from 94.10 to 50.00 points, a decline of 44.1 points. The material constraint dimension rose from 92.20 to 96.70 points, an increase of 4.5 points. The engineering judgment dimension dropped from 79.20 to 58.30 points, a decline of 20.9 points. The task expression dimension fell from 94.50 to 86.30 points, a decline of 8.2 points. The integrity rating remained pass.

Assessment of Fluctuation Sources

The Smoke evaluation uses only 10 questions per day, with 2 questions per dimension. Single-day draw fluctuations fall within the normal range. The code execution dimension experienced a significant decline of 44.1 points, far exceeding the slight increase in the material constraint dimension, indicating that this anomaly is primarily concentrated on code-related tasks. The concurrent decline of 20.9 points in the engineering judgment dimension further points to instability in the model's performance on structured output and logical reasoning tasks.

Draw fluctuations and real model degradation must be distinguished. Single-day data cannot directly prove a permanent decline in model capability, but the 44.1-point drop in code execution has exceeded the normal draw interval, warranting continued observation in subsequent days.

Should Attention Be Given?

文心一言4.5 Smoke's main ranking score of 71.02 is still higher than some competing products, but the code execution dimension score of 50.00 is at a low level. If this dimension fails to recover above 80 points within the next three days, it may be necessary to consider whether the model has systemic issues in code generation tasks. Based on single-day data alone, it is more likely to be a draw fluctuation, but continued monitoring is still recommended.

The material constraint dimension remains high at 96.70 points, indicating no degradation in the model's citations and factual constraints. The overall decline in the main ranking is primarily driven by the code execution and engineering judgment dimensions.

Code execution halved by 22 points in a single day. 文心一言4.5 needs three consecutive days of data to prove itself.

Data source: YZ Index (YZ Index) | Run #188 | View raw data