Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?

May 22, 2026 400 Views - Read Source Winzheng Index

Gemini 3.1 Pro Code Execution Smoke快测 Model Fluctuations Google AI

Gemini 3.1 Pro saw a direct drop of 8.5 points on the main leaderboard in today's Smoke evaluation, with the code execution dimension plummeting from 66.70 to 57.20, and the material constraint dimension sliding from 86.30 to 79.00. This single-day decline is already extreme for a quick test that only includes 10 questions per day.

Source of Fluctuation: Lottery or Real Degradation

The Smoke evaluation randomly selects 2 questions per dimension daily, and the small sample size makes it easy to amplify the daily standard deviation. The 9.5-point drop in the code execution dimension is very likely due to the fact that the questions drawn that day included high-difficulty topics such as logarithmic computation and recursive optimization. If the model skips intermediate steps in complex multi-step reasoning, it will directly drag down the score.

The 7.3-point drop in the material constraint dimension is more worthy of attention. This dimension mainly examines whether the model strictly adheres to the material boundaries provided by the user. Today's questions may have contained a large amount of easily confusable external knowledge. If Gemini 3.1 Pro exhibits excessive extrapolation, points will be deducted.

Side Leaderboard Data Shows Inconsistent Signals

The engineering judgment (side leaderboard, AI-assisted evaluation) dropped from 58.40 to 50.00, while task expression rose sharply from 30.00 to 50.00. Such dramatic reverse fluctuations in different dimensions on the same day for the same model indicate that its output consistency has significantly decreased. Combined with the fact that its stability score is only 31.7, it can be inferred that the current answer quality of Gemini 3.1 Pro has considerable randomness.

Recent Industry Dynamics Add to the Impact

Google has recently been concentrating resources on advancing the Gemini 2.5 series and native multimodal capabilities, and the iteration pace of the 3.1 Pro version has noticeably slowed. Some developers have reported that when debugging long chains of code, the model frequently starts omitting intermediate verification steps, which is highly consistent with today's collapse in the code execution dimension.

At the same time, the continuous leading performance of OpenAI o1 and Anthropic Claude 4 on code benchmarks has also put immense pressure on Google in terms of engineering deployment. The tilt in resource allocation may have caused a temporary "blood loss" in certain sub-capabilities of 3.1 Pro.

Do We Need to Pay Close Attention?

Taking all factors into account, this decline is mainly caused by a combination of question sampling fluctuations and declining model consistency, rather than a systemic capability degradation. However, if similar dimensional fluctuations occur for two consecutive weeks, a longer-term 7-day moving average tracking should be initiated. Currently, the judgment is an "observation period," not an "alert period."

If Gemini 3.1 Pro cannot return to a code execution score above 62 points in next week's Smoke evaluation, developers should consider reducing their reliance on it for production environment code generation tasks.

Data source: YZ Index (YZ Index) | Run #127 | View Raw Data

Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?

Source of Fluctuation: Lottery or Real Degradation

Side Leaderboard Data Shows Inconsistent Signals

Recent Industry Dynamics Add to the Impact

Do We Need to Pay Close Attention?

Related Reviews

Winzheng Index Gemini 3.1 Pro Tops with 82.97 Points, Execution Score of 75 Points Widens Gap with Second Place

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Winzheng Index Gemini 3.1 Pro Tops with 98.47 Points, Claude's Execution Score Plunges 27.2 to 72.8