Claude Sonnet 4.6 Smoke Main Ranking Plunges 15.3 Points, Code Execution Drops 25 Points in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Claude Sonnet 4.6's main ranking score dropped from 97.84 to 82.52 points, a single-day decline of 15.3 points.

Core Dimension Changes

The code execution dimension fell from 100.00 points yesterday to 75.00 points today, a drop of 25 points; the material constraint dimension dropped from 95.20 points to 91.70 points, a decline of 3.5 points. In contrast, two side dimensions saw notable rebounds: engineering judgment rose from 89.60 points to 100.00 points, and task expression increased from 75.80 points to 92.50 points.

Sampling Characteristics of Smoke Evaluation

The Smoke evaluation consists of only 10 questions per day, 2 per dimension, making the sample size extremely small. A 25-point fluctuation in the code execution dimension over a single day is within the normal range under this evaluation framework. The material constraint dimension only dropped 3.5 points, indicating that the model's foundational ability to follow constraints has not degraded systematically.

Both side dimensions—engineering judgment and task expression—improved simultaneously, suggesting that on the questions sampled this time, the model demonstrated better judgment logic and clarity of expression. This further supports the view that the fluctuation is primarily due to differences in question difficulty rather than an overall decline in model capability.

Should Continuous Attention Be Paid?

The 15.3-point drop in the main ranking was primarily driven by the single dimension of code execution, which scored a perfect 100 yesterday but only 75 today. This is highly likely because at least one of the two questions today was significantly harder than yesterday's. The integrity rating remains pass, with no abnormal signals detected.

Under the current data, this decline in Claude Sonnet 4.6 is more likely a sampling fluctuation than genuine degradation. It is recommended to observe the Smoke data over the next 3-5 days. If code execution consistently scores below 85 points, then consider initiating a formal long-run re-evaluation.

A Smoke plunge is more likely the luck of two questions than a decline of the model.

Data source: YZ Index | Run #205 | View raw data