文心一言4.5 Engineering Judgment Plunges from 50 to 10, Yet Main Rank Surges 14.5

文心一言4.5 Engineering Judgment Plunges from 50 to 10, Yet Main Rank Surges 14.5

文心一言4.5 exhibited significant divergence in today's Smoke Quick Test: engineering judgment scores plummeted directly from 50 to 10, task expression also dropped from 30 to 10, but material constraint soared from 55.8 to 80.5, ultimately pulling the main rank score from 74 to 88.48.

Sampling Fluctuation or Genuine Degradation

The Smoke evaluation only samples 10 questions per day, with only 2 questions per dimension, resulting in a very small sample size and naturally large standard deviation for daily scores. If the two questions for engineering judgment and task expression happen to involve multi-step reasoning or strict format output scenarios, the model can be heavily penalized for even a single misstep. In such cases, a drop of 40 or 20 points falls within the normal statistical range and cannot directly be interpreted as model degradation.

On the other hand, the significant improvement in material constraint is likely because the sampled questions today had clearer requirements for source citation and formatting, and 文心一言4.5 performed better in linking citations and aligning numerical values. Since the main rank only considers the two auditable dimensions of code execution and material constraint, the 24.7-point increase in material constraint directly overshadowed the modest 5-point decrease in execution.

Impact of Recent Industry Developments

Over the past two weeks, Baidu has been concentrating resources on refining 文心一言's grounding capability in search scenarios, with internal beta versions specifically optimized for citation accuracy. This aligns with today's improvement in material constraint scores, indicating that the model is still iterating on auditable constraint dimensions.

Engineering judgment and task expression belong to the side rank for AI-assisted evaluation, and Baidu has not publicly disclosed specialized training logs for these two items. Considering that side rank questions are inherently more subjective, today's low scores are more likely due to question sampling bias rather than a change in the model's overall strategy.

Whether Immediate Attention is Needed

No immediate alarm is needed. The integrity rating rising from fail to warn is already a positive signal, indicating that the model's basic performance in rejecting harmful requests and avoiding hallucinations has at least not deteriorated. Only after three consecutive days of low scores on the same type of side rank would it warrant launching a deep evaluation. Currently, the single-day data remains within the range of sampling noise.

It is recommended to extend the observation window to at least 5 days of accumulated Smoke results, and then combine with weekly rank data to determine the true trend.

A single-day side rank collapse does not equal model degradation; the genuine improvement in material constraint is the most certain signal for 文心一言4.5 at this moment.

Data source: YZ Index | Run #129 | View raw data