In the YZ Index June 2026 Smoke Evaluation, Claude Opus 4.7's main score dropped from 100.00 yesterday to 84.01 today, and its code execution dimension directly fell from 100.00 to 72.80.
Core Dimension Breakdown
This Smoke Evaluation consists of only 10 questions, and the score of the code execution dimension is determined by the results of 2 questions. Claude Opus 4.7's code execution dropped from a perfect 100.00 to 72.80, indicating that at least one of these two questions had a clear mistake. Material adherence dropped from 100.00 to 97.70, a decline of only 2.3 points, showing that the model's ability to follow given materials remains high. Engineering judgment rose from 90.90 to 100.00, while task expression dropped from 98.60 to 91.90.
Nature of the Fluctuation
In the Smoke Evaluation, there are only 2 questions per dimension per day, so the daily standard deviation of scores is inherently larger. Claude Opus 4.7's material adherence remained almost unchanged, and engineering judgment actually improved, indicating that the model's overall ability has not systematically deteriorated. It is more likely an incidental fluctuation due to question sampling. Especially since the code execution dimension has only 2 questions, a single high-difficulty or ambiguously worded question can cause a sharp drop of 27.2 points.
The stability score of 31.7 has already clearly indicated that this model has high score variance on similar questions. This Smoke result is consistent with the stability indicator.
Whether Continued Monitoring Is Needed
A single day of Smoke data is insufficient to determine a true model degradation. It is recommended to observe the trend of the same dimension for 3-5 consecutive days. Only if code execution remains below 85 points and material adherence also declines should a deep evaluation be triggered. If the anomaly is confined to a single day, there is no need to overinterpret it.
Currently, the integrity rating remains pass. Although the model's response consistency has shortcomings, it has not reached the admission threshold. Claude Opus 4.7 actually hit a new single-day high on the engineering judgment dimension, indicating that it remains competitive on tasks requiring multi-step reasoning.
Data source: YZ Index | Run #205 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接