Claude Opus 4.7's main ranking in today's Smoke evaluation dropped from 93.48 to 70.93, a single-day decline of 22.6 points. The code execution dimension plummeted from a perfect 100 to 50, making it the core driver of this decline.
Data Breakdown: Single Dimension Dominates the Decline
Compared to yesterday's data, the code execution dimension experienced an extreme swing of -50 points, while material constraints rose from 85.50 to 96.50 points, and engineering judgment and task expression increased by 16.7 and 20 points respectively. The main ranking is only composed of the weighted combination of code execution and material constraints, so the collapse in code execution directly determined the overall result.
The Smoke evaluation only has 10 questions per day, with 2 questions per dimension, making the sample size extremely small. A single question mistake can cause a violent swing of 50 points. This is consistent with the meaning of the stability dimension — a stability score of 31.7 points itself indicates that the model's output consistency on similar questions is low.
Fluctuation or Degradation: Need Third-Day Verification
If the code execution dimension remains below 60 points for three consecutive days, it can be preliminarily judged as a real change in model capability. With only one day of data currently, this still falls within the range of sampling fluctuation. It is recommended to continue tracking the same dimension tomorrow. If the score rebounds to above 80 points, this decline is likely caused by a sudden increase in question difficulty.
Notably, the integrity rating changed from "warn" to "pass", indicating that the model reduced hallucinations or overconfidence in this round of answers, contrasting with the loss of code execution points. A possible scenario is that the model adopted a more conservative output strategy in code tasks, leading to a decline in scores.
Short-Term Observation in Industry Context
Anthropic has recently been focusing on reasoning alignment and safety training for the Claude 4 series. Some developers have reported that in complex code generation scenarios, it tends to provide step-by-step explanations rather than directly outputting complete code. This behavioral change may conflict with the scoring criteria for code execution questions in the Smoke evaluation.
If this trend continues, Claude Opus 4.7's competitiveness in programming assistant applications will be directly affected. It is recommended to pay attention to the code execution sample distribution in next week's full evaluation before drawing long-term conclusions.
A single-day main ranking fluctuation of 22.6 points itself does not constitute an emergency alert, but continuous tracking over three days remains necessary.
Data source: YZ Index (YZ Index) | Run #123 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接