Gemini 2.5 Pro's Material Constraint Plummets 14 Points, Main Ranking Rises 15.9 Instead – Sampling Variance or True Regression?

Gemini 2.5 Pro in today's Smoke evaluation saw its material constraint dimension drop directly from 91.50 to 77.50, a decline of 14 points, which is a significant anomaly in a single-day quick test.

Data Breakdown: Main Ranking vs. Side Rankings Contradiction

The main ranking only looks at code execution and material constraint. Code execution rose from 95 to 100, while material constraint plummeted, yet the main ranking still rose from 74.00 to 89.88, an overall increase of 15.9 points. The side ranking for engineering judgment jumped from 30.00 to 58.40, and task expression rose from 10.00 to 30.00. The two side rankings together pulled up the total score, but the decline in material constraint was completely masked.

The Smoke evaluation only selects 10 questions per day, 2 questions per dimension, making the sample extremely small. If a high-difficulty material constraint question is drawn, the score can easily drop sharply. A fluctuation of 14 points, like yesterday's 91.5 and today's 77.5, is not uncommon with such a small sample.

True Regression or Random Sampling?

Looking at the trend over the past two weeks, Gemini 2.5 Pro's material constraint score fluctuated in the 83-92 range, with today's 77.5 being a clear low point. Code execution has remained consistently above 95, indicating that the model still maintains a high standard in structured output and logical coherence.

If it's merely due to sampling variance, material constraint will likely rebound above 85 tomorrow. If material constraint remains below 80 for two or three consecutive days, it is more likely that the model has a systemic issue with long-context factual consistency.

Recent Industry Dynamics and Possible Triggers

Google recently expanded Gemini 2.5 Pro's context window to further 2 million tokens, while internally testing a new chain-of-thought compression algorithm. With the larger window, the model's risk of "factual drift" when handling long documents increases, which is directly related to the material constraint dimension.

Additionally, Google is accelerating the shift of 2.5 Pro's weights toward multimodal alignment, potentially temporarily sacrificing pure-text factual constraint capabilities. This aligns with the timing of today's material constraint crash.

Should We Be Concerned?

For now, sampling variance remains the primary explanation, but if material constraint stays below 80 for multiple days, caution is warranted. The honesty rating has shifted from fail to warn, indicating that the model's performance in rejecting harmful requests and factual consistency has improved, not a full regression.

It is recommended to track data for three consecutive days. Only if material constraint persists below 80 and is further verified with long-context benchmarks (e.g., NarrativeQA) can a conclusion of true regression be drawn.

The 14-point drop in material constraint is like a mirror, reflecting the cruelty of small-sample quick tests and the hidden cost of expanding the model's context.

Data source: YZ Index (YZ Index) | Run #126 | View raw data