Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

In the June 2026 Smoke review of the YZ Index, Claude Sonnet 4.6's main score dropped from 96.45 to 70.52, code execution fell from 100.00 to 50.00, and material constraint rose from 92.10 to 95.60.

Sharp Fluctuation Driven by a Single Dimension

This 25.9-point decline in the main score was almost entirely determined by the code execution dimension. That dimension dropped directly from 100.00 yesterday to 50.00, a decrease of 50 points. The material constraint dimension, on the other hand, rose 3.5 points from 92.10 to 95.60, the engineering judgment dimension remained unchanged at 100.00, and task expression increased from 84.20 to 87.50. Among the two core main-score dimensions, only code execution experienced a cliff-like drop.

Characteristics of the Smoke Review and the Impact of Lottery

The Smoke review uses only 10 questions per day, with 2 questions per dimension, so the daily score standard deviation is naturally large. This time, the code execution dimension may have drawn questions sensitive to specific programming scenarios, causing the model to lose 50 points in a single day. The material constraint dimension rose slightly over the same period, indicating no systematic issue in the model's fundamental ability to follow constraints.

Real Degradation or Random Fluctuation?

Based on single-day data, this is more likely a random fluctuation caused by the question lottery. The engineering judgment dimension maintained 100.00 for two consecutive days, the task expression dimension also rose slightly, and the integrity rating remained pass, with no synchronized decline across dimensions. A real model degradation typically involves simultaneous deterioration in multiple dimensions, rather than an isolated 50-point drop in a single dimension.

Should We Continue to Monitor?

It is recommended to place Claude Sonnet 4.6 on the watchlist for tomorrow's Smoke review. If the code execution dimension remains below 70 points for two consecutive days, then combined with formal evaluation data, determine whether there is a version-level change. At present, a single-day 50-point drop alone is insufficient to conclude that the model's capabilities have undergone systematic degradation.

A 50-point halving of code execution is more likely a result of the lottery drawing than the model suddenly failing.

Data source: YZ Index | Run #201 | View Raw Data