Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Jun 22, 2026 28 Views - Read Source Winzheng Index

Gemini 2.5 Pro Code Execution Smoke Test 单日波动模型稳定性

Gemini 2.5 Pro's main leaderboard score on the YZ Index June 2026 Smoke Benchmark dropped from 99.28 yesterday to 71.33 today, a single-day decline of 28 points. The Code Execution dimension fell from 100.00 to 50.00 points, making it the sole cause of the decline.

Score Breakdown: Single Dimension Decides the Outcome

The Smoke Benchmark consists of only 2 questions per dimension per day. In Code Execution, at least one of the two questions was not passed, directly causing a loss of 50 points in that dimension. Material Constraint dropped from 98.40 to 97.40, a decline of only 1 point. Engineering Judgment remained at 100.00, and Task Expression rose from 96.30 to 100.00. The main leaderboard is derived from a weighted combination of Code Execution and Material Constraint, so the 50-point plunge directly lowered the overall ranking.

Analysis of the Source of Volatility

The Smoke Benchmark has a small sample size, making random question selection the most probable cause. Code Execution tasks are sensitive to the difficulty of specific problems—a single high-complexity programming question can cause a 50-point gap. Confirming genuine model degradation would require consecutive days of systematic failures on similar tasks; single-day data is insufficient to support that conclusion.

Material Constraint dropped by only 1 point, indicating that the model's control over instruction following and content boundaries remains stable. No declines were observed in Engineering Judgment or Task Expression, further suggesting no overall regression in core capabilities.

Whether Sustained Attention Is Needed

This decline falls within the normal fluctuation range of small-sample rapid testing. It is recommended to observe the Code Execution score over 3–5 consecutive trading days; if the dimension persistently stays below 80 points, then initiate an in-depth evaluation. The current integrity rating remains "pass," with no access warnings triggered.

A single-day 28-point fluctuation is not uncommon in Smoke Benchmark history; the key is to distinguish random events from capability degradation.

Data source: YZ Index | Run #191 | View raw data

Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Score Breakdown: Single Dimension Decides the Outcome

Analysis of the Source of Volatility

Whether Sustained Attention Is Needed

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day

Winzheng Index Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Winzheng Index ERNIE Bot 4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

Winzheng Index Gemini 3.1 Pro Code Execution Plunges 80 Points, Main Rankings Drop 33.5 in a Single Day