Claude Opus 4.7 Material Constraint Drops 15 Points in a Single Day: Smoke Test Fluctuation or True Degradation

Claude Opus 4.7's Smoke evaluation today used only 10 questions to cause a 15-point slump in the material constraint dimension, dropping directly from 74.50 to 59.50, with the main leaderboard declining 6.8 points to 81.78. Code execution remains at a perfect score of 100, engineering judgment at 66.70, and task expression at 30.00, with zero fluctuation. The integrity rating remains "warn."

The Boundary Between Sampling Fluctuation and True Degradation

The Smoke evaluation uses only 2 questions per dimension per day, with an extremely small sample size, making a 15-point fluctuation in a single day not unusual in itself. The key lies in whether this round of material constraint score loss is concentrated on specific constraint types. Historical data shows that this model is more prone to losing points on questions that require strictly adhering to multiple material boundaries and rejecting requests that implicitly cross boundaries. If the two questions today happened to hit this type of high-difficulty constraint question, the 15-point drop could be entirely explained by sampling fluctuation.

However, if the score loss is evenly distributed and the error pattern is consistent with yesterday, it is necessary to be cautious about actual capability drift after alignment training. With only one day of data, it is not yet sufficient to determine a systematic degradation.

Cross-Validation with Recent Industry Dynamics

Anthropic completed a round of alignment fine-tuning for the Claude 4 series in the past two weeks, with a focus on strengthening the "minimum necessary information" principle. Such adjustments often make the model more conservative on material constraint questions, occasionally leading to excessive refusal or partial answers being judged as incomplete by the system. The Opus 4.7 version number suggests it may already carry the latest fine-tuning weights, and today's performance aligns closely with the timeline of this adjustment.

Meanwhile, competitors Grok and Gemini have seen slight increases in scores on similar constraint tasks recently, further highlighting the relative nature of Claude's decline.

Whether to Pay Close Attention

A single day of data is not enough to trigger an alert, but if the material constraint score remains below 65 for two consecutive days, a 3-day rolling observation window should be initiated. If it stays low on the third day and errors are concentrated in the same constraint subclass, it can be preliminarily determined to be a capability shift after fine-tuning rather than random fluctuation.

The most reasonable judgment at present is to maintain regular tracking and not immediately issue a model degradation warning.

A 15-point material constraint drop under a 10-question sampling is more likely to be luck, but if it occurs consecutively, it is worth suspecting whether Anthropic's "minimum necessary" fine-tuning has made the constraint boundary too sensitive.

Data Source: YZ Index | Run #134 | View Raw Data