Claude Opus 4.7 Main Ranking Plummets 22.6 Points, Code Execution Halved from 100

May 19, 2026 360 Views - Read Source Winzheng Index

Claude Opus 4.7 Code Execution Smoke Test Model Fluctuations Anthropic

Claude Opus 4.7's main ranking in today's Smoke evaluation dropped from 93.48 to 70.93, a single-day decline of 22.6 points. The code execution dimension plummeted from a perfect 100 to 50, making it the core driver of this decline.

Data Breakdown: Single Dimension Dominates the Decline

Compared to yesterday's data, the code execution dimension experienced an extreme swing of -50 points, while material constraints rose from 85.50 to 96.50 points, and engineering judgment and task expression increased by 16.7 and 20 points respectively. The main ranking is only composed of the weighted combination of code execution and material constraints, so the collapse in code execution directly determined the overall result.

The Smoke evaluation only has 10 questions per day, with 2 questions per dimension, making the sample size extremely small. A single question mistake can cause a violent swing of 50 points. This is consistent with the meaning of the stability dimension — a stability score of 31.7 points itself indicates that the model's output consistency on similar questions is low.

Fluctuation or Degradation: Need Third-Day Verification

If the code execution dimension remains below 60 points for three consecutive days, it can be preliminarily judged as a real change in model capability. With only one day of data currently, this still falls within the range of sampling fluctuation. It is recommended to continue tracking the same dimension tomorrow. If the score rebounds to above 80 points, this decline is likely caused by a sudden increase in question difficulty.

Notably, the integrity rating changed from "warn" to "pass", indicating that the model reduced hallucinations or overconfidence in this round of answers, contrasting with the loss of code execution points. A possible scenario is that the model adopted a more conservative output strategy in code tasks, leading to a decline in scores.

Short-Term Observation in Industry Context

Anthropic has recently been focusing on reasoning alignment and safety training for the Claude 4 series. Some developers have reported that in complex code generation scenarios, it tends to provide step-by-step explanations rather than directly outputting complete code. This behavioral change may conflict with the scoring criteria for code execution questions in the Smoke evaluation.

If this trend continues, Claude Opus 4.7's competitiveness in programming assistant applications will be directly affected. It is recommended to pay attention to the code execution sample distribution in next week's full evaluation before drawing long-term conclusions.

A single-day main ranking fluctuation of 22.6 points itself does not constitute an emergency alert, but continuous tracking over three days remains necessary.

Data source: YZ Index (YZ Index) | Run #123 | View raw data

Claude Opus 4.7 Main Ranking Plummets 22.6 Points, Code Execution Halved from 100

Data Breakdown: Single Dimension Dominates the Decline

Fluctuation or Degradation: Need Third-Day Verification

Short-Term Observation in Industry Context

Related Reviews

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

Winzheng Index Claude Opus 4.7 Code Execution Plummets from 100 to 50, Main Score Drops 25.7 Points in a Single Day

Winzheng Index 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Winzheng Index Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day