Claude Opus 4.7 Posts a 17.6-Point Drop in Material Constraint, but a Contrarian 11.9-Point Gain in Code Execution

Claude Opus 4.7 today in the Smoke evaluation saw its Material Constraint score drop directly by 17.6 points, from 98.3 to 80.7, and its main ranking slipped from 65.19 to 63.82. Meanwhile, Code Execution for the same model rose from 38.1 to 50.0, and Task Expression also jumped from 30.0 to 50.0. This starkly contrasting performance makes one wonder: is this a matter of random draw luck, or is there an issue with the model itself?

Question fluctuation or genuine degradation?

The Smoke evaluation runs only 10 questions per day, with 2 questions per dimension, resulting in an extremely small sample size where single-day standard deviation is naturally prone to widening. The Material Constraint dimension primarily examines how strictly a model adheres to given materials. If the questions happen to involve scenarios requiring strict citation of the original text or rejection of external knowledge, the model may be heavily penalized for adding any extra explanation or unauthorized supplementation. Today's score of 80.7, compared to yesterday's 98.3, is more likely the result of encountering high-difficulty constraint questions than the model suddenly "forgetting" how to follow instructions.

However, this cannot be fully attributed to luck. The concurrent 11.9-point rise in the Code Execution dimension indicates that the model is actually more stable in structured output and logical reasoning. The fact that both capabilities exhibit opposing fluctuations suggests a more plausible explanation: Anthropic may have recently performed a narrow preference alignment or safety reinforcement on Opus 4.7, causing the model to strike a new trade-off between "strictly adhering to materials" and "proactively completing information."

Corroboration from industry dynamics

This month, Anthropic applied a safety fine-tuning to the Claude series, with a heightened focus on strengthening the ability to "refuse unsafe or boundary-crossing requests." Such adjustments often make the model more cautious in the Material Constraint dimension, leading it to opt for conservative answers when encountering ambiguous instructions, thereby lowering scores. At the same time, Anthropic continues to optimize code-related capabilities; the code execution baseline of version 4.7 was already higher than its predecessor, and today's score of 50.0 is closer to its true level.

The two side-dimensions, Engineering Judgment and Task Expression, also moved in opposite directions simultaneously, further confirming the targeted nature of this adjustment: the model has been recalibrated between being "obedient" and being "smart."

Does it warrant close attention?

A single-day drop of 17.6 points is anomalous in a quick test, but it does not yet constitute conclusive evidence of model degradation. It is advisable to monitor the median changes of the same dimension over 3–5 consecutive days. Only if the Material Constraint score persistently stays below 85, accompanied by synchronous declines in other dimensions, should it be judged as genuine capability regression. For now, it is more likely a side effect of Anthropic's safety iteration, and remains within a controllable range.

For application scenarios that rely on Material Constraint, developers should add more explicit instructions like "use only the given materials" within the prompt to reduce the model's room for autonomous extrapolation.

A sharp drop in a quick test often reveals not that the model has collapsed, but that the training objective has quietly shifted direction.

Data source: YZ Index | Run #127 | View raw data