Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?

Gemini 3.1 Pro saw a direct drop of 8.5 points on the main leaderboard in today's Smoke evaluation, with the code execution dimension plummeting from 66.70 to 57.20, and the material constraint dimension sliding from 86.30 to 79.00. This single-day decline is already extreme for a quick test that only includes 10 questions per day.

Source of Fluctuation: Lottery or Real Degradation

The Smoke evaluation randomly selects 2 questions per dimension daily, and the small sample size makes it easy to amplify the daily standard deviation. The 9.5-point drop in the code execution dimension is very likely due to the fact that the questions drawn that day included high-difficulty topics such as logarithmic computation and recursive optimization. If the model skips intermediate steps in complex multi-step reasoning, it will directly drag down the score.

The 7.3-point drop in the material constraint dimension is more worthy of attention. This dimension mainly examines whether the model strictly adheres to the material boundaries provided by the user. Today's questions may have contained a large amount of easily confusable external knowledge. If Gemini 3.1 Pro exhibits excessive extrapolation, points will be deducted.

Side Leaderboard Data Shows Inconsistent Signals

The engineering judgment (side leaderboard, AI-assisted evaluation) dropped from 58.40 to 50.00, while task expression rose sharply from 30.00 to 50.00. Such dramatic reverse fluctuations in different dimensions on the same day for the same model indicate that its output consistency has significantly decreased. Combined with the fact that its stability score is only 31.7, it can be inferred that the current answer quality of Gemini 3.1 Pro has considerable randomness.

Recent Industry Dynamics Add to the Impact

Google has recently been concentrating resources on advancing the Gemini 2.5 series and native multimodal capabilities, and the iteration pace of the 3.1 Pro version has noticeably slowed. Some developers have reported that when debugging long chains of code, the model frequently starts omitting intermediate verification steps, which is highly consistent with today's collapse in the code execution dimension.

At the same time, the continuous leading performance of OpenAI o1 and Anthropic Claude 4 on code benchmarks has also put immense pressure on Google in terms of engineering deployment. The tilt in resource allocation may have caused a temporary "blood loss" in certain sub-capabilities of 3.1 Pro.

Do We Need to Pay Close Attention?

Taking all factors into account, this decline is mainly caused by a combination of question sampling fluctuations and declining model consistency, rather than a systemic capability degradation. However, if similar dimensional fluctuations occur for two consecutive weeks, a longer-term 7-day moving average tracking should be initiated. Currently, the judgment is an "observation period," not an "alert period."

If Gemini 3.1 Pro cannot return to a code execution score above 62 points in next week's Smoke evaluation, developers should consider reducing their reliance on it for production environment code generation tasks.


Data source: YZ Index (YZ Index) | Run #127 | View Raw Data