Gemini 2.5 Pro's main score in the YZ Index Smoke evaluation dropped from 89.79 yesterday to 70.53 today, a decline of 19.3 points. Among them, the code execution dimension fell from 100.00 to 55.00, while the material constraint dimension rose from 77.30 to 89.50.
Data Breakdown: Single Dimension Dominated the Drop
The main score is composed of only two dimensions: code execution and material constraint. Today's code execution score of 55.00, down 45 points from yesterday's 100.00, directly pulled down the overall main score. Although material constraint rose by 12.2 points, it was not enough to offset the loss in code execution. Engineering judgment slightly decreased from 84.00 to 82.00, and task expression rose from 86.00 to 90.00, both changes within 5 points, having limited impact on the main score.
Smoke Evaluation Characteristics and Causes of Volatility
The Smoke evaluation uses only 10 questions per day, with 2 questions per dimension. The small sample size naturally leads to larger daily standard deviations. The code execution dimension dropped directly from a perfect score to 55.00, exceeding the typical fluctuation range seen in past similar quick evaluations. This change could be due to differences in difficulty caused by question sampling, or it could indicate an issue with the model's output consistency on specific programming tasks. Single-day data alone cannot distinguish between the two.
Today's material constraint dimension score of 89.50, higher than yesterday's 77.30, shows that the model actually improved in adhering to material restrictions. The opposite movements in these two core dimensions further suggest that today's result is not a systemic regression in the model's overall capability, but rather a dimension-specific impact from test questions.
Should It Be a Cause for Concern?
The single-day main score drop of 19.3 points is a relatively large fluctuation in Smoke's evaluation history, but it has not yet reached the level of consecutive multi-day declines in the same direction. The integrity rating remains at "pass," indicating no new issues in the model's basic compliance. It is recommended to focus on tracking the score distribution of the code execution dimension over the next 3–5 Smoke evaluation cycles. Only if scores consistently fall below 70 points should deeper, multi-question, long-cycle testing be initiated.
The current data only shows one anomalous fluctuation and does not yet constitute sufficient evidence of genuine model degradation.
Data source: YZ Index | Run #170 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接