Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved

May 20, 2026 384 Views - Read Source Winzheng Index

Gemini 2.5 Pro 工程判断 Smoke Test Model Fluctuations Integrity Rating

Gemini 2.5 Pro lost 22.6 points directly on the mainboard in today's Smoke evaluation. The core execution dimension dropped from 100 to 95, and material constraints also saw a slight decline. This seemingly "normal fluctuation" actually reveals deeper underlying issues.

Fluctuation or True Degradation?

The Smoke evaluation only covers 10 questions per day, with 2 questions per dimension, so the standard deviation for a single day's score is inherently large. However, this time engineering judgment plummeted from 66.7 to 30, and task expression fell directly from 50 to 10 — a drop far exceeding the historical average. The slight declines in execution and material constraints can be attributed to question difficulty sampling, but the simultaneous collapse of the two side-dimensions suggests a significant drop in answer consistency when the model faces questions requiring engineering trade-offs or clear task output.

The stability dimension of the YZ Index has already shown that Gemini 2.5 Pro's score standard deviation has been relatively high recently, indicating inconsistent performance on similar problems. Today's side-dimension collapse is likely a concentrated outburst of this instability, rather than a mere matter of question luck.

What Industry Trends Confirm

Recently, Google's iteration focus for the Gemini series has been on safety alignment and refusal mechanisms. Multiple developers have reported that the model increasingly tends to give vague responses with excessive disclaimers when asked to provide specific engineering advice or compare multiple solutions. This "safety-first" adjustment directly impacts the two side-dimensions: engineering judgment and task expression.

Additionally, Gemini 2.5 Pro has begun to omit more intermediate steps in complex code execution scenarios, leading to small deductions in the execution dimension. Today's integrity rating switching from pass to fail is more likely due to the model showing contradictions or refusing to answer core questions in some problems.

Should It Be a Concern?

The mainboard is still dominated by code execution and material constraints. Although both dropped today, they remain at high levels, indicating that the model's foundational capabilities have not collapsed entirely. However, the halving-level drop in engineering judgment and task expression, along with an outright fail in the integrity rating, has exceeded the normal range of sampling fluctuations.

For users who rely on Gemini for engineering solution design or structured output, today's data sends a clear signal: under the current version, the model's consistency has significantly decreased in scenarios that require complex judgment and clear expression. In the short term, it is advisable to lower the trust weight for its engineering decision outputs and wait for the next version update or retesting with a larger sample.

A 22.6-point drop on the mainboard might be explained by sampling, but a 30-point collapse in engineering judgment and a fail integrity rating can no longer be masked by the word "luck."

Data source: YZ Index | Run #124 | View raw data

Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved

Fluctuation or True Degradation?

What Industry Trends Confirm

Should It Be a Concern?

Related Reviews

Winzheng Index Gemini 2.5 Pro Material Constraint Plunges 15.2 Points, Code Execution Soars 45 Points

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

Winzheng Index Claude Sonnet 4.6 Smoke Main Ranking Plunges 15.3 Points, Code Execution Drops 25 Points in a Single Day

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail