Claude Opus 4.7's Main Score Plunges 8.2 Points, Material Constraint Drops 18.3 in a Single Day

May 26, 2026 581 Views - Read Source Winzheng Index

Claude Opus 4.7 Material Constraints Smoke Test 主榜波动 Integrity Rating

Claude Opus 4.7 scored only 88.53 on the main list in today's Smoke review, down 8.2 points from yesterday, placing the decline in an abnormal range for the current daily ten-question quick test. The core loss came from the material constraint dimension: falling directly from 92.80 to 74.50, a single-day drop of 18.3 points.

Sampling Fluctuation or Real Degradation

Smoke review tests only two material constraint questions per day, with a very small sample size; theoretically, the single-day standard deviation can reach 12-15 points. However, the 18.3-point drop exceeds the historical 95% confidence interval. Both questions yesterday strictly adhered to material boundaries, but today at least one question showed clear overstepping or excessive generation, specifically manifested as the model still introducing external common knowledge for supplementation under the explicit instruction to "only use the given table data."

The engineering judgment dimension actually rose from 58.40 to 66.70, indicating that the model has not weakened overall in scenarios requiring trade-offs. This further points to the problem being concentrated on the single capability of "answering strictly according to the material," rather than a general reasoning degradation.

Recent Industry Developments Align with the Timeline

Over the past two weeks, Anthropic has made minor iterations to the API security policy for the Claude series, focusing on strengthening "avoiding the generation of content that could be used to circumvent restrictions." This adjustment may have inadvertently amplified the model's sensitivity to "material constraint" type instructions, leading to excessive conservatism or misjudgment in boundary determination. Combined with today's integrity rating changing from pass to warn, the system detected that the model provided a consistent but material-inconsistent answer in at least one question, triggering an integrity flag.

In similar daily quick tests for comparable models, only three cases have had a single-day fluctuation exceeding 15 points in the material constraint dimension, all accompanied by API-side policy updates. Claude Opus 4.7's performance this time is highly similar to those three.

Need for Continued Attention

Yes. Material constraint is one of the two auditable dimensions of the YZ Index main list, and its weight directly affects the final ranking. If the model's material constraint score cannot return above 85 points within the next three Smoke windows, its long-term stability expectations will likely need to be downgraded. Currently, a single day's data is insufficient to determine a permanent decline in model capability, but it is enough to be placed on the "watch list."

The significance of daily quick tests is precisely to quickly capture such localized anomalies, rather than waiting for weekly or monthly rankings to discover problems.

A 18-point material constraint collapse reminds all models: the more you pursue safety, the easier it is to crash in scenarios where strict execution of instructions is most needed.

Data source: YZ Index (YZ Index) | Run #132 | View raw data

Claude Opus 4.7's Main Score Plunges 8.2 Points, Material Constraint Drops 18.3 in a Single Day

Sampling Fluctuation or Real Degradation

Recent Industry Developments Align with the Timeline

Need for Continued Attention

Related Reviews

Winzheng Index Claude Opus 4.7 drops 14 points on main leaderboard, Code Execution falls from 100 to 69

Winzheng Index DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100

Winzheng Index GPT-o3 Code Execution Surges 52.5 Points, Material Constraint Drops 15.7 Points, Main Leaderboard Rises 21.8 Points