Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Gemini 2.5 Pro's main leaderboard score on the YZ Index June 2026 Smoke Benchmark dropped from 99.28 yesterday to 71.33 today, a single-day decline of 28 points. The Code Execution dimension fell from 100.00 to 50.00 points, making it the sole cause of the decline.

Score Breakdown: Single Dimension Decides the Outcome

The Smoke Benchmark consists of only 2 questions per dimension per day. In Code Execution, at least one of the two questions was not passed, directly causing a loss of 50 points in that dimension. Material Constraint dropped from 98.40 to 97.40, a decline of only 1 point. Engineering Judgment remained at 100.00, and Task Expression rose from 96.30 to 100.00. The main leaderboard is derived from a weighted combination of Code Execution and Material Constraint, so the 50-point plunge directly lowered the overall ranking.

Analysis of the Source of Volatility

The Smoke Benchmark has a small sample size, making random question selection the most probable cause. Code Execution tasks are sensitive to the difficulty of specific problems—a single high-complexity programming question can cause a 50-point gap. Confirming genuine model degradation would require consecutive days of systematic failures on similar tasks; single-day data is insufficient to support that conclusion.

Material Constraint dropped by only 1 point, indicating that the model's control over instruction following and content boundaries remains stable. No declines were observed in Engineering Judgment or Task Expression, further suggesting no overall regression in core capabilities.

Whether Sustained Attention Is Needed

This decline falls within the normal fluctuation range of small-sample rapid testing. It is recommended to observe the Code Execution score over 3–5 consecutive trading days; if the dimension persistently stays below 80 points, then initiate an in-depth evaluation. The current integrity rating remains "pass," with no access warnings triggered.

A single-day 28-point fluctuation is not uncommon in Smoke Benchmark history; the key is to distinguish random events from capability degradation.


Data source: YZ Index | Run #191 | View raw data