Claude Sonnet 4.6 Material Constraint Plummets 22.6 Points, While Code Execution Doubles Directly

Claude Sonnet 4.6 showed significant divergence in today's Smoke evaluation: the material constraint dimension dropped directly from 81.00 to 58.40, a decline of 22.6 points, while the code execution dimension jumped from 50 to 100 points, ultimately pulling the main leaderboard up 17.3 points to 81.28.

Source of Fluctuation: Sampling or Degradation?

Smoke evaluation only has 10 questions per day, with 2 questions per dimension, resulting in extremely small sample sizes and naturally large standard deviations for a single day. This drop in material constraint is highly likely due to today's drawn questions being more stringent in fact-checking and citation boundaries. The surge in code execution similarly points to a change in question difficulty distribution, rather than the model suddenly "finding its groove." Only if material constraint remains below 60 for three consecutive days would there be reason to suspect that Anthropic's recent internal iterations have negatively affected long-context factual consistency.

Recent Industry Dynamics Comparison

Anthropic has not released a new version of the Claude 4 series in the past two weeks, but according to reliable sources, it is currently undergoing safety alignment reinforcement training internally. Such training often comes at the cost of sacrificing some open-ended material citation capability in exchange for lower hallucination rates. Today's task expression (side leaderboard, AI-assisted evaluation) dropped from 50 to 30, a downward movement consistent with material constraint, confirming that the model has become more conservative in controlling output boundaries.

The move from 63.95 to 81.28 on the main leaderboard masks real risks.

Engineering judgment (side leaderboard, AI-assisted evaluation) remained unchanged at 50, indicating that the model's decision logic in engineering scenarios was not significantly perturbed. The integrity rating remains pass, ruling out the possibility of cheating or data contamination.

Should We Pay Extra Attention?

A single-day material constraint drop of 22.6 points is still within the acceptable range of historical fluctuation intervals for Smoke evaluation. It is recommended to continuously observe 72 hours of data: if material constraint rebounds above 70 points in the next two days, it can be determined as pure sampling noise; if it persistently stays below 65 points, then its weighting on the main leaderboard should be reduced in the weekly report. At the current stage, there is no need to make drastic adjustments to the usage strategy of Claude Sonnet 4.6.

Model capability has never been a straight line, but a random walk with noise. Treating a single-day plunge as an alert, rather than a conclusion, is the correct approach.


Data source: YZ Index | Run #128 | View raw data