Gemini 3.1 Pro Tops with 98.47 Points, Claude's Execution Score Plunges 27.2 to 72.8

Jun 30, 2026 32 Views - Read Source Winzheng Index

Gemini 3.1 Pro Code Execution Smoke 轻量评测主榜跌幅模型结构分析

In the June 30, 2026 Smoke Lite evaluation of the YZ Index, Gemini 3.1 Pro ranked first with a main score of 98.47 (Code Execution 100, Material Constraints 96.6).

This evaluation covered 11 models, with core_overall weighted as 0.55 × Code Execution + 0.45 × Material Constraints. Both Gemini 3.1 Pro and Grok 4 achieved a perfect execution score of 100, but Grok 4 scored only 95.5 in constraints, placing it 0.49 points behind in the main ranking.

Structural Differences Between Execution and Constraints

DeepSeek V4 Pro scored 96.65 on the main ranking, with 94.8 for execution and 98.9 for constraints. It had a clear advantage in constraints, 2.3 points higher than Gemini 3.1 Pro, but ranked third due to being 5.2 points lower in execution.

GPT-o3 and GPT-5.5 both scored 83.3 in execution, with constraints of 98.9 and 94.3 respectively. The former scored 2.07 points higher in the main ranking, showing that a 4.6-point gap in constraints directly determines the ranking.

Widespread Decline in Execution Scores Across Multiple Models

Compared to yesterday, Claude Opus 4.7's execution score dropped 27.2 to 72.8, causing a 16-point drop in the main ranking; Claude Sonnet 4.6's execution score dropped 25 to 75, with a 15.3-point drop in the main ranking. Qwen3 Max's execution score dropped 12.7 to 75, resulting in a 9.1-point drop in the main ranking. Gemini 2.5 Pro's execution score dropped 21.9 to 53.1, with a 13.6-point drop in the main ranking.

文心一言 4.5's execution score dropped 14.6 to 75, and its constraint score dropped 20.2 to 66.3, causing a 17.1-point drop in the main ranking, making it the model with the largest decline today.

Anomaly Signal Analysis

Both Claude models saw their execution scores drop by more than 25 points simultaneously, while their constraint scores remained at 97.7 and 91.7, indicating that material constraint capability was unaffected, with the issue concentrated on code execution consistency.

Gemini 2.5 Pro's constraint score remained at 96.6, on par with Gemini 3.1 Pro, but its execution score was only 53.1, dragging down the main ranking by 25.79 points, exposing a weakness in execution.

DeepSeek V4 Pro is the only model with an execution score below 95 that still made it into the top three; its constraint score of 98.9 offset the execution gap.

Today's data shows that the top two in the main ranking are models with an execution score of 100, while all models with execution scores below 75 fell out of the top five. GPT-o3 and DeepSeek V4 Pro, both with constraint scores of 98.9, ranked fourth and third respectively, proving that high constraint scores can provide a ranking buffer for models with moderate execution scores.

Execution volatility is reshaping the Smoke Lite leaderboard, providing buffer space for models with stable constraint scores.

The next Smoke evaluation will verify whether these execution score declines persist.

Data source: YZ Index | Run #205 | View raw data

Gemini 3.1 Pro Tops with 98.47 Points, Claude's Execution Score Plunges 27.2 to 72.8

Structural Differences Between Execution and Constraints

Widespread Decline in Execution Scores Across Multiple Models

Anomaly Signal Analysis

Related Reviews

Winzheng Index Claude Opus 4.7 Leads with 97.12 Points, Perfect Execution but Material Constraint Score of 93.6 Drags Down Overall

Winzheng Index 11 Models See Collective Plunge in Code Execution Scores, GPT-5.5 Leads Smoke Lightweight List with 95.24 Points

Winzheng Index Gemini 3.1 Pro Leads with 96.96 Points, Claude Opus 4.7 Only 0.13 Behind

Winzheng Index Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking