Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved

Gemini 2.5 Pro lost 22.6 points directly on the mainboard in today's Smoke evaluation. The core execution dimension dropped from 100 to 95, and material constraints also saw a slight decline. This seemingly "normal fluctuation" actually reveals deeper underlying issues.

Fluctuation or True Degradation?

The Smoke evaluation only covers 10 questions per day, with 2 questions per dimension, so the standard deviation for a single day's score is inherently large. However, this time engineering judgment plummeted from 66.7 to 30, and task expression fell directly from 50 to 10 — a drop far exceeding the historical average. The slight declines in execution and material constraints can be attributed to question difficulty sampling, but the simultaneous collapse of the two side-dimensions suggests a significant drop in answer consistency when the model faces questions requiring engineering trade-offs or clear task output.

The stability dimension of the YZ Index has already shown that Gemini 2.5 Pro's score standard deviation has been relatively high recently, indicating inconsistent performance on similar problems. Today's side-dimension collapse is likely a concentrated outburst of this instability, rather than a mere matter of question luck.

What Industry Trends Confirm

Recently, Google's iteration focus for the Gemini series has been on safety alignment and refusal mechanisms. Multiple developers have reported that the model increasingly tends to give vague responses with excessive disclaimers when asked to provide specific engineering advice or compare multiple solutions. This "safety-first" adjustment directly impacts the two side-dimensions: engineering judgment and task expression.

Additionally, Gemini 2.5 Pro has begun to omit more intermediate steps in complex code execution scenarios, leading to small deductions in the execution dimension. Today's integrity rating switching from pass to fail is more likely due to the model showing contradictions or refusing to answer core questions in some problems.

Should It Be a Concern?

The mainboard is still dominated by code execution and material constraints. Although both dropped today, they remain at high levels, indicating that the model's foundational capabilities have not collapsed entirely. However, the halving-level drop in engineering judgment and task expression, along with an outright fail in the integrity rating, has exceeded the normal range of sampling fluctuations.

For users who rely on Gemini for engineering solution design or structured output, today's data sends a clear signal: under the current version, the model's consistency has significantly decreased in scenarios that require complex judgment and clear expression. In the short term, it is advisable to lower the trust weight for its engineering decision outputs and wait for the next version update or retesting with a larger sample.

A 22.6-point drop on the mainboard might be explained by sampling, but a 30-point collapse in engineering judgment and a fail integrity rating can no longer be masked by the word "luck."

Data source: YZ Index | Run #124 | View raw data