Gemini 2.5 Pro's Material Constraint Plummets 14 Points, Main Ranking Rises 15.9 Instead – Sampling Variance or True Regression?

May 21, 2026 395 Views - Read Source Winzheng Index

Gemini 2.5 Pro Material Constraints Smoke Test 性能波动 Model Degradation

Gemini 2.5 Pro in today's Smoke evaluation saw its material constraint dimension drop directly from 91.50 to 77.50, a decline of 14 points, which is a significant anomaly in a single-day quick test.

Data Breakdown: Main Ranking vs. Side Rankings Contradiction

The main ranking only looks at code execution and material constraint. Code execution rose from 95 to 100, while material constraint plummeted, yet the main ranking still rose from 74.00 to 89.88, an overall increase of 15.9 points. The side ranking for engineering judgment jumped from 30.00 to 58.40, and task expression rose from 10.00 to 30.00. The two side rankings together pulled up the total score, but the decline in material constraint was completely masked.

The Smoke evaluation only selects 10 questions per day, 2 questions per dimension, making the sample extremely small. If a high-difficulty material constraint question is drawn, the score can easily drop sharply. A fluctuation of 14 points, like yesterday's 91.5 and today's 77.5, is not uncommon with such a small sample.

True Regression or Random Sampling?

Looking at the trend over the past two weeks, Gemini 2.5 Pro's material constraint score fluctuated in the 83-92 range, with today's 77.5 being a clear low point. Code execution has remained consistently above 95, indicating that the model still maintains a high standard in structured output and logical coherence.

If it's merely due to sampling variance, material constraint will likely rebound above 85 tomorrow. If material constraint remains below 80 for two or three consecutive days, it is more likely that the model has a systemic issue with long-context factual consistency.

Recent Industry Dynamics and Possible Triggers

Google recently expanded Gemini 2.5 Pro's context window to further 2 million tokens, while internally testing a new chain-of-thought compression algorithm. With the larger window, the model's risk of "factual drift" when handling long documents increases, which is directly related to the material constraint dimension.

Additionally, Google is accelerating the shift of 2.5 Pro's weights toward multimodal alignment, potentially temporarily sacrificing pure-text factual constraint capabilities. This aligns with the timing of today's material constraint crash.

Should We Be Concerned?

For now, sampling variance remains the primary explanation, but if material constraint stays below 80 for multiple days, caution is warranted. The honesty rating has shifted from fail to warn, indicating that the model's performance in rejecting harmful requests and factual consistency has improved, not a full regression.

It is recommended to track data for three consecutive days. Only if material constraint persists below 80 and is further verified with long-context benchmarks (e.g., NarrativeQA) can a conclusion of true regression be drawn.

The 14-point drop in material constraint is like a mirror, reflecting the cruelty of small-sample quick tests and the hidden cost of expanding the model's context.

Data source: YZ Index (YZ Index) | Run #126 | View raw data

Gemini 2.5 Pro's Material Constraint Plummets 14 Points, Main Ranking Rises 15.9 Instead – Sampling Variance or True Regression?

Data Breakdown: Main Ranking vs. Side Rankings Contradiction

True Regression or Random Sampling?

Recent Industry Dynamics and Possible Triggers

Should We Be Concerned?

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Gemini 3.1 Pro Material Constraint Drops 17.8 Points, Main Ranking Falls 6 Points