Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking

In today's Smoke lightweight review of 11 models, there was a rare "perfect score wave" in the code execution dimension. The top 9 models all achieved a perfect execution score of 100, and the ranking was entirely determined by grounding. Claude Sonnet 4.6 ultimately ranked first with a total score of 97.98, with a grounding score of 95.5.

Perfect Execution Becomes Standard, Grounding Decides Victory

The formula core_overall = 0.55×execution + 0.45×grounding gives models with a perfect execution score of 100 at least 55 base points. The remaining 45 points depend almost entirely on grounding performance. 豆包 Pro scored 94.3 in grounding, achieving a total of 97.44, ranking second; Grok 4 scored 93.5 in grounding, ranking third. Gemini 2.5 Pro and Claude Opus 4.7 also maintained grounding scores above 91.8.

Looking at the lower end, 文心一言 4.5 had an execution score of only 50, dragging its total to 58.69. Qwen3 Max, despite a perfect execution score, had a grounding score of 73.5 and an integrity rating of "fail", ranking 10th.

No Drastic Fluctuations, Industry Enters a Stable Period

Compared with yesterday, all model scores changed within 0.3 points, with no abnormal signals. This indicates that current mainstream models have formed relatively fixed capability boundaries in the 10-question quick test scenario. After the iteration of the past six months, code execution ability has become a "passing line" capability for most models, while grounding still shows significant stratification.

Notably, GPT-5.5 and GPT-o3 have grounding scores of 82.3 and 65 respectively, a gap of 17.3 points, indicating that different versions within OpenAI still have significant room for iteration in the grounding direction.

Grounding Becomes the Core Battlefield of the Next Stage

Based on today's data, every 1-point increase in grounding score affects the overall ranking by approximately 0.45 points. Claude Sonnet 4.6, with its high grounding score of 95.5, leads the 6th-ranked Gemini 3.1 Pro by nearly 5.4 points. If no new models are released in the coming week, the ranking will likely maintain its current pattern.

Code execution has become a standard; grounding is the real watershed.

Data source: YZ Index | Run #158 | View raw data