Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

May 17, 2026 382 Views - Read Source Winzheng Index

Claude Opus 4.7 Material Constraints Smoke快测性能波动 Anthropic更新

Claude Opus 4.7 suffered a direct loss of 9 points on the main leaderboard in today's Smoke evaluation, dropping from 97.75 to 88.75. The core reason is a sharp drop in the material constraint dimension from 95 to 75 points. This is not a minor fluctuation but a hard loss of 20 points in a single day.

Fluctuation or Degradation: Let the Data Speak First

The Smoke evaluation only has 10 questions per day, 2 questions per dimension. The small sample size naturally leads to high variance. The code execution dimension still scored a perfect 100 today, proving that the model has not collapsed in pure logic and execution chains. Engineering judgment actually rose from 30 to 38.4 points, while task expression remained unchanged at 30 points. The only obvious decline was in material constraint.

Material constraint mainly assesses the model's fidelity to given materials and boundary control. The 20-point drop likely comes from the two questions drawn today, which involved long document citations, fact-checking, or scenarios prohibiting extrapolation. The model may have exhibited over-summarization or added unauthorized information in one of the questions, directly resulting in a low score.

Interpretation in Light of Recent Industry Developments

Anthropic just pushed a context optimization patch for the Claude 4 series in the past two weeks, officially claiming to improve long document processing speed. Speed improvements sometimes come at the cost of strict boundary control. A similar situation occurred during the Claude 3 Opus era: after a context acceleration update, the grounding score declined for three consecutive days, then stabilized through fine-tuning.

Meanwhile, OpenAI o3-mini and Gemini 2.5 Pro have recently maintained stable material constraint scores in the 88-92 range in similar quick tests. If Claude Opus 4.7 wants to hold its place in the top tier of the main leaderboard, it must re-establish an advantage in the grounding dimension.

Whether It Deserves Continued Attention

A single-day decline of 9 points ranks in the top 15% of Smoke's historical records, but it has not yet reached the level requiring an immediate alarm. It is recommended to observe for three consecutive days: if material constraint remains below 80 points for two consecutive days, combined with the stability dimension (currently 31.7 points, with considerable volatility), then there is reason to suspect systematic degradation of the model.

Currently, the most reasonable explanation is still random error from question selection. Claude Opus 4.7's code execution and engineering judgment remain robust, and the overall capability foundation has not been shaken.

Dropping 9 points in one Smoke evaluation may be luck; losing material constraint for three consecutive times is a signal.

Data source: YZ Index | Run #119 | View Raw Data

Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

Fluctuation or Degradation: Let the Data Speak First

Interpretation in Light of Recent Industry Developments

Whether It Deserves Continued Attention

Related Reviews

Winzheng Index Claude Opus 4.7 Leads with 97.12 Points, Perfect Execution but Material Constraint Score of 93.6 Drags Down Overall

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Benchmark Drops 27.5 Points, Code Execution from 100 to 50

Winzheng Index 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Winzheng Index Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend