Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day

Jun 14, 2026 404 Views - Read Source Winzheng Index

Gemini 2.5 Pro Code Execution Smoke Test 单日波动模型一致性

Gemini 2.5 Pro's main score in the YZ Index Smoke evaluation dropped from 89.79 yesterday to 70.53 today, a decline of 19.3 points. Among them, the code execution dimension fell from 100.00 to 55.00, while the material constraint dimension rose from 77.30 to 89.50.

Data Breakdown: Single Dimension Dominated the Drop

The main score is composed of only two dimensions: code execution and material constraint. Today's code execution score of 55.00, down 45 points from yesterday's 100.00, directly pulled down the overall main score. Although material constraint rose by 12.2 points, it was not enough to offset the loss in code execution. Engineering judgment slightly decreased from 84.00 to 82.00, and task expression rose from 86.00 to 90.00, both changes within 5 points, having limited impact on the main score.

Smoke Evaluation Characteristics and Causes of Volatility

The Smoke evaluation uses only 10 questions per day, with 2 questions per dimension. The small sample size naturally leads to larger daily standard deviations. The code execution dimension dropped directly from a perfect score to 55.00, exceeding the typical fluctuation range seen in past similar quick evaluations. This change could be due to differences in difficulty caused by question sampling, or it could indicate an issue with the model's output consistency on specific programming tasks. Single-day data alone cannot distinguish between the two.

Today's material constraint dimension score of 89.50, higher than yesterday's 77.30, shows that the model actually improved in adhering to material restrictions. The opposite movements in these two core dimensions further suggest that today's result is not a systemic regression in the model's overall capability, but rather a dimension-specific impact from test questions.

Should It Be a Cause for Concern?

The single-day main score drop of 19.3 points is a relatively large fluctuation in Smoke's evaluation history, but it has not yet reached the level of consecutive multi-day declines in the same direction. The integrity rating remains at "pass," indicating no new issues in the model's basic compliance. It is recommended to focus on tracking the score distribution of the code execution dimension over the next 3–5 Smoke evaluation cycles. Only if scores consistently fall below 70 points should deeper, multi-question, long-cycle testing be initiated.

The current data only shows one anomalous fluctuation and does not yet constitute sufficient evidence of genuine model degradation.

Data source: YZ Index | Run #170 | View raw data

Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day

Data Breakdown: Single Dimension Dominated the Drop

Smoke Evaluation Characteristics and Causes of Volatility

Should It Be a Cause for Concern?

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

Winzheng Index DeepSeek V4 Pro Main Score Plummets 11.9 Points, Code Execution Drops 13.3

Winzheng Index Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points