Claude Opus 4.7 suffered a direct loss of 9 points on the main leaderboard in today's Smoke evaluation, dropping from 97.75 to 88.75. The core reason is a sharp drop in the material constraint dimension from 95 to 75 points. This is not a minor fluctuation but a hard loss of 20 points in a single day.
Fluctuation or Degradation: Let the Data Speak First
The Smoke evaluation only has 10 questions per day, 2 questions per dimension. The small sample size naturally leads to high variance. The code execution dimension still scored a perfect 100 today, proving that the model has not collapsed in pure logic and execution chains. Engineering judgment actually rose from 30 to 38.4 points, while task expression remained unchanged at 30 points. The only obvious decline was in material constraint.
Material constraint mainly assesses the model's fidelity to given materials and boundary control. The 20-point drop likely comes from the two questions drawn today, which involved long document citations, fact-checking, or scenarios prohibiting extrapolation. The model may have exhibited over-summarization or added unauthorized information in one of the questions, directly resulting in a low score.
Interpretation in Light of Recent Industry Developments
Anthropic just pushed a context optimization patch for the Claude 4 series in the past two weeks, officially claiming to improve long document processing speed. Speed improvements sometimes come at the cost of strict boundary control. A similar situation occurred during the Claude 3 Opus era: after a context acceleration update, the grounding score declined for three consecutive days, then stabilized through fine-tuning.
Meanwhile, OpenAI o3-mini and Gemini 2.5 Pro have recently maintained stable material constraint scores in the 88-92 range in similar quick tests. If Claude Opus 4.7 wants to hold its place in the top tier of the main leaderboard, it must re-establish an advantage in the grounding dimension.
Whether It Deserves Continued Attention
A single-day decline of 9 points ranks in the top 15% of Smoke's historical records, but it has not yet reached the level requiring an immediate alarm. It is recommended to observe for three consecutive days: if material constraint remains below 80 points for two consecutive days, combined with the stability dimension (currently 31.7 points, with considerable volatility), then there is reason to suspect systematic degradation of the model.
Currently, the most reasonable explanation is still random error from question selection. Claude Opus 4.7's code execution and engineering judgment remain robust, and the overall capability foundation has not been shaken.
Dropping 9 points in one Smoke evaluation may be luck; losing material constraint for three consecutive times is a signal.
Data source: YZ Index | Run #119 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接