Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Jun 16, 2026 33 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Code Execution Smoke Test 单日波动主榜排名

In the YZ Index June 2026 evaluation of 11 models, Claude Sonnet 4.6's Smoke evaluation code execution score directly dropped from 100.00 yesterday to 50.00 today, while the main score fell from 79.44 to 72.50.

Direct Data on a One-Day 50-Point Drop

The Smoke evaluation consists of only 10 questions per day, with 2 questions per dimension. The code execution dimension went from a perfect score yesterday to half that today, while material constraints rose from 54.30 to 100.00, engineering judgment rose from 75.50 to 95.90, and task expression rose from 84.50 to 100.00. Three out of four dimensions saw significant increases, only code execution experienced a -50 point change, leading to a net drop of 6.9 points in the main score.

Question Sampling Volatility or True Degradation

Questions in the Smoke evaluation are randomly selected each day, so the standard deviation of daily scores is inherently large. Claude Sonnet 4.6 achieving perfect scores in both material constraints and task expression indicates that the model performed stably on constraint and expression questions within the same evaluation, with only the code execution dimension showing an extreme low. This is more likely a local fluctuation caused by question sampling rather than an overall degradation of the model's capabilities.

The code execution dimension has only 2 questions, and a single mistake can cause a score drop of 50 points.

Should Continued Attention Be Given

The main score only dropped by 6.9 points, and the integrity rating remains pass; current data is insufficient to determine systematic degradation. However, the code execution dimension falling directly from a perfect score to zero represents a volatility amplitude that exceeds normal sampling ranges. If the same dimension remains low tomorrow, monitoring frequency should be increased.

Combining today's full set of scores, Claude Sonnet 4.6 shows a clear dimensional divergence in the Smoke evaluation: material constraints and task expression reached peak values, while code execution hit an extreme low. The slight decline in the main score is primarily driven by the single dimension of code execution, not a multi-dimensional simultaneous drop.

This model requires continued observation of score distribution in the same dimension over the next 2-3 days to distinguish between random fluctuations and genuine capability changes.

Data source: YZ Index (YZ Index) | Run #182 | View raw data

Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Direct Data on a One-Day 50-Point Drop

Question Sampling Volatility or True Degradation

Should Continued Attention Be Given

Related Reviews

Winzheng Index Smoke Evaluation: Claude Sonnet 4.6 Leads with 99.78 Points, GPT Series Stuck at 74 Points

Winzheng Index 豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

Winzheng Index Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points

Winzheng Index Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day