Claude Sonnet 4.6 Material Constraint Plunges 22 Points, Code Execution Hits 100

May 26, 2026 498 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Material Constraints Smoke Test 单日波动 Model Degradation

In today's Smoke evaluation, Claude Sonnet 4.6's Material Constraint dimension dropped directly from 96.50 to 74.50, a single-day decline of 22 points.

Data Breakdown: Slight Overall Score Decline Masking Local Collapse

The overall score only slipped from 90.56 to 88.53, a seemingly mild drop of 2 points. However, breaking down the two core dimensions reveals a sharp divergence: Code Execution jumped from 85.70 to a perfect 100, while Material Constraint plummeted. Engineering Judgment rose slightly by 8.3 points, Task Expression remained unchanged at 30 points. Integrity rating remains pass.

The Smoke evaluation only has 10 questions per day, 2 per dimension, with a very small sample size. A 22-point single-day fluctuation is not unusual in itself. The question is whether the decline exceeds the normal sampling range.

Fluctuation or Degradation: Probabilistic Assessment of Two Explanations

The first possibility is question sampling bias. If the Material Constraint questions consecutively involve scenarios requiring strict citation of original documents and rejection of over-generation, and if the model exhibits hallucination or excessive polishing on one of them, it would directly lower the dimension's score. Historical data from multiple periods shows that Claude is typically stable at 90+ on Material Constraint, and this 74.5 is closer to a historical low.

The second possibility is a genuine capability change. Anthropic has recently conducted multiple rounds of safety and alignment fine-tuning on the Claude 4 series, focusing on strengthening "rejection of unreasonable requests" and "avoiding overconfidence." Such adjustments can sometimes make the model conservative or evasive on tasks requiring precise citation and strict boundary judgment, leading to a drop in Material Constraint scores.

Considering industry developments in the past two weeks, the second explanation carries more weight. After Claude Sonnet 4.6 was released, users reported occasional "over-cautiousness" in long-context citation tasks, consistent with the direction of this Material Constraint collapse.

Should It Be a Major Concern?

Currently it's still a single-day signal, insufficient to determine systematic degradation of the model. However, if Material Constraint remains below 85 points for the next three trading days, continuous tracking should be initiated. Code Execution hitting perfect scores indicates that the model's underlying reasoning ability is not impaired; the problem is concentrated on the specific constraint of "material usage discipline."

For teams relying on Claude for research reports, legal documents, or technical document generation, this signal is worth noting.

A 22-point Material Constraint plunge could be just sampling noise; if it occurs consecutively, it may be the real manifestation of alignment cost.

Data source: YZ Index | Run #132 | View raw data

Claude Sonnet 4.6 Material Constraint Plunges 22 Points, Code Execution Hits 100

Data Breakdown: Slight Overall Score Decline Masking Local Collapse

Fluctuation or Degradation: Probabilistic Assessment of Two Explanations

Should It Be a Major Concern?

Related Reviews

Winzheng Index Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Gemini 3.1 Pro Material Constraint Drops 17.8 Points, Main Ranking Falls 6 Points

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points