In the YZ Index June 2026 evaluation of 11 models, Claude Sonnet 4.6's Smoke evaluation code execution score directly dropped from 100.00 yesterday to 50.00 today, while the main score fell from 79.44 to 72.50.
Direct Data on a One-Day 50-Point Drop
The Smoke evaluation consists of only 10 questions per day, with 2 questions per dimension. The code execution dimension went from a perfect score yesterday to half that today, while material constraints rose from 54.30 to 100.00, engineering judgment rose from 75.50 to 95.90, and task expression rose from 84.50 to 100.00. Three out of four dimensions saw significant increases, only code execution experienced a -50 point change, leading to a net drop of 6.9 points in the main score.
Question Sampling Volatility or True Degradation
Questions in the Smoke evaluation are randomly selected each day, so the standard deviation of daily scores is inherently large. Claude Sonnet 4.6 achieving perfect scores in both material constraints and task expression indicates that the model performed stably on constraint and expression questions within the same evaluation, with only the code execution dimension showing an extreme low. This is more likely a local fluctuation caused by question sampling rather than an overall degradation of the model's capabilities.
The code execution dimension has only 2 questions, and a single mistake can cause a score drop of 50 points.
Should Continued Attention Be Given
The main score only dropped by 6.9 points, and the integrity rating remains pass; current data is insufficient to determine systematic degradation. However, the code execution dimension falling directly from a perfect score to zero represents a volatility amplitude that exceeds normal sampling ranges. If the same dimension remains low tomorrow, monitoring frequency should be increased.
Combining today's full set of scores, Claude Sonnet 4.6 shows a clear dimensional divergence in the Smoke evaluation: material constraints and task expression reached peak values, while code execution hit an extreme low. The slight decline in the main score is primarily driven by the single dimension of code execution, not a multi-dimensional simultaneous drop.
This model requires continued observation of score distribution in the same dimension over the next 2-3 days to distinguish between random fluctuations and genuine capability changes.
Data source: YZ Index (YZ Index) | Run #182 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接