In today's Smoke evaluation of the YZ Index, Claude Opus 4.7's main score dropped from 97.12 to 71.47, a decline of 25.7 points.
Core Dimension Changes
The code execution dimension was directly halved from 100.00 yesterday to 50.00, while material constraints rose from 93.60 to 97.70, engineering judgment rose from 95.80 to 100.00, and task expression rose from 97.40 to 98.60. The main score is calculated by weighting the code execution and material constraint dimensions according to the rules, and this sharp decline was entirely determined by the single dimension of code execution.
Analysis of Fluctuation Causes
The Smoke evaluation only includes 10 questions per day, 2 questions per dimension, so the sample size is small and sampling fluctuation is within normal range. However, the single-day loss of 50 points in the code execution dimension far exceeds the slight recovery of 4.1 points in material constraints, indicating that the question difficulty or the model's response to specific problem types has shown significant inconsistency. The two side dimensions of engineering judgment and task expression both rose slightly, showing that the model's performance on non-code tasks remains at a high level.
If this decline is mainly due to question sampling, it is one-time noise; if the model's processing logic for similar code problems has undergone a systematic shift, it may indicate actual capability degradation. With only one day of data, it is impossible to distinguish between the two.
Whether Continued Attention Is Needed
The code execution dimension directly affects the main ranking. This drop has significantly moved Claude Opus 4.7's position backward on the main chart. It is recommended to closely monitor this dimension's score in the next 3-5 days of Smoke evaluations. If it continuously falls below 70 points, only then determine whether actual degradation has occurred. The integrity rating remains pass, indicating that the model has not shown basic issues like refusal to answer or formatting errors.
The stability dimension measures the standard deviation of the model's scores when answering similar questions multiple times. This single-day sharp fluctuation of Claude Opus 4.7 reflects reduced consistency, but this is unrelated to the accuracy itself.
A single day of Smoke data can only provide a signal; continuous tracking is needed to confirm a trend.
Data source: YZ Index | Run #201 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接