Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Claude Sonnet 4.6 showed clear anomalies in today's Smoke rapid test, with the main leaderboard dropping 12.3 points overall. The core reason was the material constraint dimension falling directly from 96.30 to 69.00, a drop of 27.3 points. The code execution dimension maintained a perfect score of 100, engineering judgment slightly rebounded 8.4 points to 38.40, and task expression remained unchanged at 30 points.

Why did the material constraint dimension fluctuate so drastically

The Smoke evaluation only includes 10 questions per day, with only 2 questions drawn for the material constraint dimension. A single question error can cause fluctuations of over 30 points, so it cannot be directly concluded that the model's capabilities have degraded. However, a drop of 27.3 points exceeds the normal sampling range and is worth tracking over three consecutive days of data.

The material constraint primarily examines the model's adherence to given documents and instruction boundaries. Yesterday's high score indicated the model could strictly answer questions based on materials, while today's low score may stem from complex constraint nesting or counterfactual materials in the two questions, leading to over-inference or omission of key restrictions by the model.

Recent industry dynamics and model update correlation

Anthropic performed a small-scale alignment fine-tuning of the Claude series over the past two weeks, focusing on enhancing "helpfulness" and "concise responses." Some developers reported that the new version is more willing to supplement external knowledge in open-ended questions, which potentially conflicts with the material constraint requirement of "strictly limited to given materials."

If the weight adjustments in this fine-tuning affected the model's sensitivity to instruction boundaries, it could expose issues in high-constraint scenarios like Smoke. The code execution dimension still maintains a perfect score, indicating that basic reasoning abilities are unaffected; the problem is concentrated on the boundary judgment of "when to strictly cite materials and when to expand."

Should we continue to monitor

This drop is a signal worth attention. Material constraint is one of the two auditable dimensions of the main leaderboard, and its stability directly affects the model's usability in high-constraint scenarios such as enterprise RAG and contract review. If this dimension remains below 80 points for the next three days, it can be judged as systematic degradation rather than sampling noise.

Currently, the integrity rating remains pass, indicating that the model has not exhibited severe issues like refusing to answer or fabricating facts, only a decline in constraint adherence. It is recommended to increase the number of material constraint questions in the next full evaluation to reduce the impact of single-day fluctuations.

The slight improvement in engineering judgment also confirms that the model is more inclined to "actively supplement information," which could be an advantage in creative tasks but becomes a deduction point in strictly material-driven tasks.

When models start to waver between "obedience" and "intelligence," material constraint scores are often the first to sound the alarm.

Data source: YZ Index | Run #119 | View raw data