Gemini 3.1 Pro Main Score Plunges 11.1 Points, Code Execution Halved from 100

Gemini 3.1 Pro saw an 11.1-point drop in its main score during today's Smoke quick test. The core reason was that the code execution dimension fell directly from a perfect 100 to 75, while material constraint edged up from 69 to 75. The main score consists only of these two auditable dimensions, and this change directly lowered the overall score.

Question Sampling or Real Degradation

The Smoke evaluation uses only 10 questions per day, with 2 questions corresponding to one dimension, so the sample size is small — daily fluctuations are normal. However, the 25-point drop from 100 to 75 in code execution exceeds the normal range of sampling variation. Yesterday, this model performed steadily on similar code tasks, but today it made consecutive errors in simple function implementation and boundary condition handling, pointing to an inconsistency in model output.

The material constraint dimension actually rose by 6 points, indicating that the model still has some resilience in citation restrictions and fact-checking. The opposite movement of the two auditable dimensions further rules out the possibility of a systemic overall failure.

Recent Industry Developments

Google has recently shifted its focus for the Gemini series toward multimodal and agent frameworks, which has diluted resources for code-specific optimization. Community feedback shows that version 3.1 occasionally exhibits logical jumps in long-context code completion scenarios, highly consistent with the issues revealed in this Smoke evaluation. The engineering score (side score, AI-assisted evaluation) jumped from 10 to 50, also confirming strategic adjustments on non-code tasks — but this was not reflected in the auditable main score.

In comparison, both Claude and GPT-4o maintain code execution scores above 90 in similar quick tests over the same period, making Gemini 3.1 Pro's decline more pronounced.

Is It Worth Continued Attention?

This drop primarily stems from real volatility in code execution, not just luck of the draw. We recommend monitoring Smoke data for 3–5 consecutive days. If code execution consistently falls below 85 points, it may indicate a phase-specific degradation in the model's code capabilities. The integrity rating remains at "pass," and the usage threshold is not affected in the short term, but developers should add manual verification steps when calling code generation functions in production environments.

The current signal is enough to raise caution, but not yet to the point of requiring large-scale migration.

The 25-point gap from 100 to 75 in code execution exposes the model's true boundaries more directly than any marketing claim.

Data source: YZ Index | Run #121 | View Raw Data