Claude Opus 4.7 Smoke Evaluation Main Benchmark Drops 27.5 Points, Code Execution from 100 to 50

In the June 2026 YZ Index test of 11 models, Claude Opus 4.7 Smoke's main benchmark score dropped from 100.00 yesterday to 72.50 today, with the code execution dimension falling directly from 100.00 to 50.00.

Single-Day Data Breakdown

The code execution dimension saw a -50 point change, the material constraint dimension remained unchanged at 100.00, the engineering judgment dimension rose from 83.40 to 100.00, and the task expression dimension stayed at 100.00. As a result, the main benchmark score dropped by 27.5 points, with the integrity rating still at pass.

Judgment on the Source of Volatility

The Smoke Evaluation only has 10 questions per day, with 2 questions per dimension. Daily draw differences can lead to large score swings. This time, both questions in the code execution dimension may have failed, whereas yesterday both passed, directly causing a 50-point gap. The material constraint dimension achieved full marks on both days, indicating that the model's output still meets the constraint requirements in that dimension.

The engineering judgment dimension actually rose by 16.6 points, and the task expression dimension had zero change, showing that the model's overall capability has not exhibited systematic degradation. The 50-point drop in a single dimension is more consistent with the random distribution of question difficulty rather than any change in model parameters or training.

Whether Continuous Attention Is Needed

If the code execution dimension in the Smoke Evaluation remains consistently below 70 points over the next three days, a formal long-term benchmark retest will be required. The current single-day data only reflects draw fluctuations and is not yet sufficient to conclude that the model's true capability has declined. The material constraint dimension maintaining a full score further confirms that the model's fundamental capabilities remain within the normal range.

When the daily quick test has a large standard deviation, a single-day score should not be directly equated with the model's long-term performance. This change in Claude Opus 4.7 is mainly concentrated in the code execution dimension; the other dimensions are stable or rising, and the overall situation remains within an acceptable range of fluctuation.

A one-day halving of code execution may not be a sign of degradation; three consecutive days of low performance is the real warning.

Data source: YZ Index | Run #195 | View raw data