Smoke's quick test results today show that Gemini 3.1 Pro ranks first with a core_overall score of 96.96, followed closely by Claude Opus 4.7 with 96.83, a gap of only 0.13 points.
Extreme Proximity Among Top Models
The top two both scored 97.5 in code execution. On material grounding, Gemini 3.1 Pro scored 96.3, while Claude Opus 4.7 scored 96. The weighted formula 0.55×Execution + 0.45×Grounding means that a tiny difference in grounding directly determines the final ranking.
This 0.13-point gap appears for the first time in consecutive days of testing, indicating that top models have entered a stage of "competition at the same level."
Obvious Shortcomings of GPT-5.5
GPT-5.5 scored 97 in execution, ranking third, but with a material grounding score of only 86.3, it dropped to fifth place. Lagging nearly 10 points in the grounding dimension, this reflects that its control over citing original materials and avoiding hallucinations is still weaker than Gemini and Claude.
In contrast, Grok 4 scored 96 in execution and 93.8 in grounding, with an overall score of 95.01, maintaining relative balance.
Execution Bottleneck for Mid-Tier Models
DeepSeek V4 Pro, Qwen3 Max, and Gemini 2.5 Pro all scored below 65 in execution, a gap of more than 30 points from the top. Qwen3 Max scored 94.8 in grounding, even higher than GPT-5.5, but was pulled far behind due to its execution score of 55.
This once again confirms that current Chinese models still have systematic shortcomings in code execution tasks.
Compared with yesterday, no significant changes were observed in any model today, with no abnormal fluctuations in the stability dimension.
When both execution and grounding are near perfect, a gap of 0.13 points is no longer a coincidence, but a real difference in the model's control over material boundaries.
Data source: YZ Index | Run #165 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接