Claude Sonnet 4.6 Material Constraint Plummets 22.6 Points, While Code Execution Doubles Directly

May 23, 2026 397 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Material Constraints Smoke Test Model Fluctuations Anthropic动态

Claude Sonnet 4.6 showed significant divergence in today's Smoke evaluation: the material constraint dimension dropped directly from 81.00 to 58.40, a decline of 22.6 points, while the code execution dimension jumped from 50 to 100 points, ultimately pulling the main leaderboard up 17.3 points to 81.28.

Source of Fluctuation: Sampling or Degradation?

Smoke evaluation only has 10 questions per day, with 2 questions per dimension, resulting in extremely small sample sizes and naturally large standard deviations for a single day. This drop in material constraint is highly likely due to today's drawn questions being more stringent in fact-checking and citation boundaries. The surge in code execution similarly points to a change in question difficulty distribution, rather than the model suddenly "finding its groove." Only if material constraint remains below 60 for three consecutive days would there be reason to suspect that Anthropic's recent internal iterations have negatively affected long-context factual consistency.

Recent Industry Dynamics Comparison

Anthropic has not released a new version of the Claude 4 series in the past two weeks, but according to reliable sources, it is currently undergoing safety alignment reinforcement training internally. Such training often comes at the cost of sacrificing some open-ended material citation capability in exchange for lower hallucination rates. Today's task expression (side leaderboard, AI-assisted evaluation) dropped from 50 to 30, a downward movement consistent with material constraint, confirming that the model has become more conservative in controlling output boundaries.

The move from 63.95 to 81.28 on the main leaderboard masks real risks.

Engineering judgment (side leaderboard, AI-assisted evaluation) remained unchanged at 50, indicating that the model's decision logic in engineering scenarios was not significantly perturbed. The integrity rating remains pass, ruling out the possibility of cheating or data contamination.

Should We Pay Extra Attention?

A single-day material constraint drop of 22.6 points is still within the acceptable range of historical fluctuation intervals for Smoke evaluation. It is recommended to continuously observe 72 hours of data: if material constraint rebounds above 70 points in the next two days, it can be determined as pure sampling noise; if it persistently stays below 65 points, then its weighting on the main leaderboard should be reduced in the weekly report. At the current stage, there is no need to make drastic adjustments to the usage strategy of Claude Sonnet 4.6.

Model capability has never been a straight line, but a random walk with noise. Treating a single-day plunge as an alert, rather than a conclusion, is the correct approach.

Data source: YZ Index | Run #128 | View raw data

Claude Sonnet 4.6 Material Constraint Plummets 22.6 Points, While Code Execution Doubles Directly

Source of Fluctuation: Sampling or Degradation?

Recent Industry Dynamics Comparison

Should We Pay Extra Attention?

Related Reviews

Winzheng Index Gemini 2.5 Pro Material Constraint Plunges 15.2 Points, Code Execution Soars 45 Points

Winzheng Index GPT-o3 Material Constraint Drops 16.8 Points, Task Expression Falls 28.3 Points

Winzheng Index Qwen3 Max Material Constraint Plunges 15.1 Points While Code Execution Surges 18.4 Points

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points