Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

May 16, 2026 24 Views - Read Source Winzheng Index

Gemini 2.5 Pro 材料约束 Smoke评测诚信评级模型波动

What stands out most about Gemini 2.5 Pro today is not its ability dropping out, but its credibility rating falling from pass to fail: the main ranking dropped 10 points, yet the code execution score didn't lose a single point.

This set of data is very unusual. From yesterday to today, Gemini 2.5 Pro's code execution is 100.00 → 100.00, unchanged; material adherence is 64.50 → 74.30, an increase of 9.8 points. According to the YZ Index v6 methodology, the main ranking only looks at two auditable dimensions: code execution and material adherence. In other words, from the perspective of ability evidence itself, it did not "collapse" on core capabilities.

But the main ranking shows a drop from 84.03 → 74.00, a single-day decline of 10 points. The real explanation is in the last line: credibility rating pass → fail. The credibility rating is not a bonus item, but a threshold for entry. Once it fails, it means the model has encountered compliance, citation authenticity, or task boundary issues that cannot be ignored during evaluation. Such issues cannot be offset by a code execution score of 100.

In a word: Gemini 2.5 Pro today is not "incapable of solving problems," but rather "its way of solving problems is not trusted."

How much can question sampling fluctuation explain?

Smoke evaluation is a quick test of 10 questions per day, with only 2 questions per dimension, so daily fluctuations are naturally amplified. For example, engineering judgment (side ranking, AI-assisted evaluation) went from 10.00 → 30.00, up 20 points; task expression (side ranking, AI-assisted evaluation) remained 30.00 → 30.00. Such magnitudes are not uncommon in a 10-question sample.

But sampling fluctuation can hardly explain the credibility rating going directly from pass to fail. Because the credibility rating looks at baseline behavior, not just one poor answer. The increase in material adherence actually indicates that the model performed better today in terms of "sticking to the given materials when answering"; if it simultaneously triggered a fail, it is more likely that a few answers had hard flaws: for example, claiming non-existent evidence as given materials, over-extrapolation, refusing to admit insufficient information, or fabricating certainty at key factual points.

Does this look like real degradation? My judgment: not for now

If it were real degradation, we would typically see code execution and material adherence decline together, or at least structural score loss in core dimensions. But today is just the opposite: code execution is perfect, material adherence increased. The simultaneous occurrence of the main ranking drop and the credibility rating fail suggests that this anomaly is more like a threshold trigger, rather than a sudden drop in the model's underlying capabilities.

Recently, Gemini 2.5 Pro's industry position is clear: it is still regarded as Google's main card in high-level reasoning, coding, and long-context scenarios, competing with OpenAI and Anthropic's flagship models for developer mindshare. Models like this from Google often come with changes in API routing, system prompts, safety policies, and version fine-tuning. For a small-sample quick test like Smoke, a single policy change can suddenly make the response style harder, more conservative, or cause anomalies in material boundaries.

Need to pay attention? Yes, but don't rush to pronounce death

My conclusion is clear: Raise the attention level, but do not conclude that Gemini 2.5 Pro's ability is degraded. Next, three things to watch: First, whether the credibility rating fail occurs consecutively; second, whether material adherence can continue to maintain above 70 points; third, whether the code execution score of 100 is just due to today's favorable question types or a stable performance.

One more point to emphasize: stability measures the consistency of score fluctuations when answering the same type of questions multiple times, with the formula max(0, 100-stddev×2), not accuracy. If a low stability score appears later, it should not be simply interpreted as "high error rate," but rather as greater output fluctuation for the same type of task.

Today's signal is not "Gemini 2.5 Pro has become dumber," but "its trust boundary has cracked." In enterprise procurement, ability determines the ceiling, but credibility rating determines whether you can get through the door.

Data source: YZ Index (YZ Index) | Run #118 | View Raw Data

Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

How much can question sampling fluctuation explain?

Does this look like real degradation? My judgment: not for now

Need to pay attention? Yes, but don't rush to pronounce death

Related Reviews

Winzheng Index DeepSeek gains 5 points but fails: 10-question Smoke test alarm

Winzheng Index Two Zero-Execution Shocks, Claude Holds at 88.75

Winzheng Index Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

Winzheng Index DeepSeek V4 Pro Main Score Plummets 16 Points! Integrity Rating Collapses, Is the Model Truly Degrading?