Gemini 3.1 Pro Leads with 96.96 Points, Claude Opus 4.7 Only 0.13 Behind

Jun 12, 2026 421 Views - Read Source Winzheng Index

Gemini 3.1 Pro Material Constraints Smoke 轻量评测代码执行差距模型稳定性

Smoke's quick test results today show that Gemini 3.1 Pro ranks first with a core_overall score of 96.96, followed closely by Claude Opus 4.7 with 96.83, a gap of only 0.13 points.

Extreme Proximity Among Top Models

The top two both scored 97.5 in code execution. On material grounding, Gemini 3.1 Pro scored 96.3, while Claude Opus 4.7 scored 96. The weighted formula 0.55×Execution + 0.45×Grounding means that a tiny difference in grounding directly determines the final ranking.

Such a tiny gap indicates that top models have entered a stage of "competition at the same level" on these two core dimensions.

Obvious Shortcomings of GPT-5.5

GPT-5.5 scored 97 in execution, ranking third, but with a material grounding score of only 86.3, it dropped to fifth place. Lagging nearly 10 points in the grounding dimension, this reflects that its control over citing original materials and avoiding hallucinations is still weaker than Gemini and Claude.

In contrast, Grok 4 scored 96 in execution and 93.8 in grounding, with an overall score of 95.01, maintaining relative balance.

Execution Bottleneck for Mid-Tier Models

DeepSeek V4 Pro, Qwen3 Max, and Gemini 2.5 Pro all scored 65 or below in execution, a gap of more than 30 points from the top. Qwen3 Max scored 94.8 in grounding, even higher than GPT-5.5, but was pulled far behind due to its execution score of 55.

This once again confirms that current Chinese models still have systematic shortcomings in code execution tasks.

Note that today marks the first run under the v6.3 judging configuration, so scores are not directly comparable with earlier results; day-over-day comparisons will resume in subsequent runs under the same configuration.

When both execution and grounding are near perfect, a gap of 0.13 points is no longer a coincidence, but a real difference in the model's control over material boundaries.

Data source: YZ Index | Run #165 | View raw data

Gemini 3.1 Pro Leads with 96.96 Points, Claude Opus 4.7 Only 0.13 Behind

Extreme Proximity Among Top Models

Obvious Shortcomings of GPT-5.5

Execution Bottleneck for Mid-Tier Models

Related Reviews

Winzheng Index Grok 4 Leads with 94.20 in Compliance, Claude and Gemini Both Drop Over 5 Points

Winzheng Index Gemini 3.1 Pro Material Constraint Drops 17.8 Points, Main Ranking Falls 6 Points

Winzheng Index Gemini 3.1 Pro Material Constraint Drops 26.6 Points, Main Ranking Still Up 5.4 Points

Winzheng Index Gemini 3.1 Pro Smoke Review Main Score Plunges 8.5 Points, Code Execution Halved from 75