Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking

Jun 11, 2026 500 Views - Read Source Winzheng Index

Material Constraints Code Execution Claude Sonnet 4.6 Doubao Pro Smoke 轻量评测

In today's Smoke lightweight review of 11 models, there was a rare "perfect score wave" in the code execution dimension. The top 9 models all achieved a perfect execution score of 100, and the ranking was entirely determined by grounding. Claude Sonnet 4.6 ultimately ranked first with a total score of 97.98, with a grounding score of 95.5.

Perfect Execution Becomes Standard, Grounding Decides Victory

The formula core_overall = 0.55×execution + 0.45×grounding gives models with a perfect execution score of 100 at least 55 base points. The remaining 45 points depend almost entirely on grounding performance. Doubao Pro scored 94.3 in grounding, achieving a total of 97.44, ranking second; Grok 4 scored 93.5 in grounding, ranking third. Gemini 2.5 Pro and Claude Opus 4.7 also maintained grounding scores above 91.8.

Looking at the lower end, ERNIE Bot 4.5 had an execution score of only 50, dragging its total to 58.69. Qwen3 Max, despite a perfect execution score, had a grounding score of 73.5 and an integrity rating of "fail", ranking 10th.

No Drastic Fluctuations, Industry Enters a Stable Period

Compared with yesterday, all model scores changed within 0.3 points, with no abnormal signals. This indicates that current mainstream models have formed relatively fixed capability boundaries in the 10-question quick test scenario. After the iteration of the past six months, code execution ability has become a "passing line" capability for most models, while grounding still shows significant stratification.

Notably, GPT-5.5 and GPT-o3 have grounding scores of 82.3 and 65 respectively, a gap of 17.3 points, indicating that different versions within OpenAI still have significant room for iteration in the grounding direction.

Grounding Becomes the Core Battlefield of the Next Stage

Based on today's data, every 1-point increase in grounding score affects the overall ranking by approximately 0.45 points. Claude Sonnet 4.6, with its high grounding score of 95.5, leads the 6th-ranked Gemini 3.1 Pro by nearly 5.4 points. If no new models are released in the coming week, the ranking will likely maintain its current pattern.

Code execution has become a standard; grounding is the real watershed.

Data source: YZ Index | Run #158 | View raw data

Smoke Review: All 10 Models Score Full Marks in Code Execution, Grounding Gap Widens Ranking

Perfect Execution Becomes Standard, Grounding Decides Victory

No Drastic Fluctuations, Industry Enters a Stable Period

Grounding Becomes the Core Battlefield of the Next Stage

Related Reviews

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points