Gemini 2.5 Pro Material Constraint Plunges 15.2 Points, Code Execution Soars 45 Points

In the June 2026 Smoke evaluation of the YZ Index, Gemini 2.5 Pro's material constraint score dropped from 92.50 to 77.30 points, a single-day decline of 15.2 points, while code execution jumped from 55.00 to 100.00 points, raising the main ranking total score from 71.88 to 89.79 points.

Daily 10-Question Sampling Most Likely Explains the Fluctuation

The Smoke evaluation uses only 2 questions per dimension per day, totaling 10 questions. Yesterday's 92.50 points in the material constraint dimension corresponded to a high pass rate, while today's 77.30 points reflects a drop in pass rate. The code execution dimension scored only 55.00 points yesterday, but hit a perfect 100.00 points today, indicating that the 2 code questions sampled today were better matched in difficulty or type to this model. Engineering judgment rose from 73.50 to 84.00 points, while task expression remained unchanged at 86.00 points. All these changes align with the randomness of small-sample sampling.

If the model were experiencing genuine degradation, multiple dimensions would typically decline simultaneously. However, the 45-point surge in code execution and the significant increase in the main ranking total score point to sampling fluctuation rather than capability regression.

No Need to Immediately Interpret as a Sign of Model Degradation

The material constraint dimension focuses on the model's adherence to given material boundaries. Today's 77.30 points remain within the passing range, and neither engineering judgment nor task expression showed a corresponding decline. The integrity rating remains at "pass," indicating that the model has not exhibited violations such as refusing to answer or fabricating content.

With only two days of data, a 15.2-point drop is insufficient to conclude systematic model degradation. Only if the same dimension consistently falls below 80 points over multiple consecutive days would it constitute a signal warranting focused tracking.

Recommendations for Subsequent Observation

It is recommended to track Gemini 2.5 Pro's material constraint score over three consecutive Smoke cycles. If the dimension rebounds above 85 points in the next two days, today's 77.30 points can be confirmed as a sampling anomaly; if it remains in the 75-80 point range, further judgment should be based on the performance of the grounding dimension in formal evaluations.

Currently, Gemini 2.5 Pro's main ranking score of 89.79 points is already at a high level, and the single-day fluctuation in material constraint has limited impact on overall usability.


Data source: YZ Index (YZ Index) | Run #166 | View raw data