In today's Smoke evaluation, Claude Sonnet 4.6's Material Constraint dimension dropped directly from 96.50 to 74.50, a single-day decline of 22 points.
Data Breakdown: Slight Overall Score Decline Masking Local Collapse
The overall score only slipped from 90.56 to 88.53, a seemingly mild drop of 2 points. However, breaking down the two core dimensions reveals a sharp divergence: Code Execution jumped from 85.70 to a perfect 100, while Material Constraint plummeted. Engineering Judgment rose slightly by 8.3 points, Task Expression remained unchanged at 30 points. Integrity rating remains pass.
The Smoke evaluation only has 10 questions per day, 2 per dimension, with a very small sample size. A 22-point single-day fluctuation is not unusual in itself. The question is whether the decline exceeds the normal sampling range.
Fluctuation or Degradation: Probabilistic Assessment of Two Explanations
The first possibility is question sampling bias. If the Material Constraint questions consecutively involve scenarios requiring strict citation of original documents and rejection of over-generation, and if the model exhibits hallucination or excessive polishing on one of them, it would directly lower the dimension's score. Historical data from multiple periods shows that Claude is typically stable at 90+ on Material Constraint, and this 74.5 is closer to a historical low.
The second possibility is a genuine capability change. Anthropic has recently conducted multiple rounds of safety and alignment fine-tuning on the Claude 4 series, focusing on strengthening "rejection of unreasonable requests" and "avoiding overconfidence." Such adjustments can sometimes make the model conservative or evasive on tasks requiring precise citation and strict boundary judgment, leading to a drop in Material Constraint scores.
Considering industry developments in the past two weeks, the second explanation carries more weight. After Claude Sonnet 4.6 was released, users reported occasional "over-cautiousness" in long-context citation tasks, consistent with the direction of this Material Constraint collapse.
Should It Be a Major Concern?
Currently it's still a single-day signal, insufficient to determine systematic degradation of the model. However, if Material Constraint remains below 85 points for the next three trading days, continuous tracking should be initiated. Code Execution hitting perfect scores indicates that the model's underlying reasoning ability is not impaired; the problem is concentrated on the specific constraint of "material usage discipline."
For teams relying on Claude for research reports, legal documents, or technical document generation, this signal is worth noting.
A 22-point Material Constraint plunge could be just sampling noise; if it occurs consecutively, it may be the real manifestation of alignment cost.
Data source: YZ Index | Run #132 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接