Claude Opus 4.7 scored only 88.53 on the main list in today's Smoke review, down 8.2 points from yesterday, placing the decline in an abnormal range for the current daily ten-question quick test. The core loss came from the material constraint dimension: falling directly from 92.80 to 74.50, a single-day drop of 18.3 points.
Sampling Fluctuation or Real Degradation
Smoke review tests only two material constraint questions per day, with a very small sample size; theoretically, the single-day standard deviation can reach 12-15 points. However, the 18.3-point drop exceeds the historical 95% confidence interval. Both questions yesterday strictly adhered to material boundaries, but today at least one question showed clear overstepping or excessive generation, specifically manifested as the model still introducing external common knowledge for supplementation under the explicit instruction to "only use the given table data."
The engineering judgment dimension actually rose from 58.40 to 66.70, indicating that the model has not weakened overall in scenarios requiring trade-offs. This further points to the problem being concentrated on the single capability of "answering strictly according to the material," rather than a general reasoning degradation.
Recent Industry Developments Align with the Timeline
Over the past two weeks, Anthropic has made minor iterations to the API security policy for the Claude series, focusing on strengthening "avoiding the generation of content that could be used to circumvent restrictions." This adjustment may have inadvertently amplified the model's sensitivity to "material constraint" type instructions, leading to excessive conservatism or misjudgment in boundary determination. Combined with today's integrity rating changing from pass to warn, the system detected that the model provided a consistent but material-inconsistent answer in at least one question, triggering an integrity flag.
In similar daily quick tests for comparable models, only three cases have had a single-day fluctuation exceeding 15 points in the material constraint dimension, all accompanied by API-side policy updates. Claude Opus 4.7's performance this time is highly similar to those three.
Need for Continued Attention
Yes. Material constraint is one of the two auditable dimensions of the YZ Index main list, and its weight directly affects the final ranking. If the model's material constraint score cannot return above 85 points within the next three Smoke windows, its long-term stability expectations will likely need to be downgraded. Currently, a single day's data is insufficient to determine a permanent decline in model capability, but it is enough to be placed on the "watch list."
The significance of daily quick tests is precisely to quickly capture such localized anomalies, rather than waiting for weekly or monthly rankings to discover problems.
A 18-point material constraint collapse reminds all models: the more you pursue safety, the easier it is to crash in scenarios where strict execution of instructions is most needed.
Data source: YZ Index (YZ Index) | Run #132 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接