文心一言4.5 Integrity Rating Fail: Code Execution Surges 42.5 Points but Side Metrics Collapse

文心一言4.5 delivered a deeply polarized report card in today's Smoke quick test: the main score rose slightly, but the integrity rating directly flipped from pass to fail. This change is not an isolated event but a concentrated reflection of severe multidimensional volatility.

Score Breakdown: Highlights and Collapses Coexist

The code execution dimension jumped from 50.00 to 92.50, an increase of 42.5 points; material constraint dropped from 88.80 to 78.50. As a result, the main score rose by a modest 6.54 points to 74.00. However, the side metrics showed a cliff-like decline: engineering judgment fell from 66.70 to 30.00, and task expression went straight from 50.00 to 10.00. The integrity rating shifted from pass to fail, meaning the model triggered at least one red-line violation in this 10-question test.

Source of Volatility: Sampling or Degradation

The Smoke evaluation uses only 2 questions per dimension per day, with a small sample size, so daily fluctuations are normal. However, the magnitude of this change goes beyond the usual random range. The dramatic improvement in code execution may stem from drawing relatively simple algorithmic questions, while the collapse in engineering judgment and task expression is closer to the model's true capability volatility. Particularly, the integrity rating moving from pass to fail typically indicates the model exhibited refusal to answer, hallucination, or violation of given constraints—something that cannot be explained solely by question difficulty.

Signals in the Industry Context

Recently, Baidu has been actively integrating search with AI, and 文心一言4.5 just completed a round of tuning benchmarked against GPT-4o. However, in real-world deployment scenarios, users still report gaps in instruction-following ability and multi-turn conversation stability. The side metrics collapse in this Smoke evaluation aligns with the industry observation that "larger models are more prone to consistency issues." In comparison, other domestic models tested in similar quick tests during the same period have not yet shown cases where the integrity rating directly hit fail.

Should We Pay Close Attention?

Yes. The integrity rating is a threshold gate; once it fails, it means the model carries safety and compliance risks in production environments. The slight increase in the main score cannot mask the cliff-like drop in side metrics, and over the long term this will erode developers' confidence in its engineering deployment. It is recommended to continuously observe Smoke data for 3-5 days. If the integrity rating remains fail or side metrics stay low, it can be basically determined that the model is genuinely degrading rather than experiencing sampling luck.

When a model trades 42.5 points for a Fail, what it truly loses is not the score, but the qualification to be trusted.

Data source: YZ Index | Run #124 | View raw data