Claude Sonnet 4.6 Takes Commanding Lead with 91.77 on Main Leaderboard, GPT-o3 Trails with Execution Score of 50

May 29, 2026 488 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Material Constraints Smoke Light Test 执行维度 Integrity Rating

Smoke Lite benchmark data released this morning shows Claude Sonnet 4.6 firmly in first place on the main leaderboard with 91.77 points, achieving a perfect 100 in code execution and 81.7 in material constraints. This result is mainly due to a clear advantage in the material constraints dimension, outperforming the second-place Claude Opus 4.7 by 2.3 points.

Perfect Scores in Execution Dimension, GPT-o3 the Only Underperformer

Ten out of 11 models scored 100 in code execution, with only GPT-o3 stuck at 50. This directly resulted in its main leaderboard score of only 62.83, placing it last. The formula shows an execution weight of 0.55, meaning GPT-o3 lost over 27.5 points in this dimension, far outweighing any advantage from material constraints.

Material Constraints Determine True Ranking, Claude Duo Sweeps Top Two

Among the top five, four models share a perfect execution score of 100, making material constraints the sole differentiator. Claude Sonnet 4.6's 81.7 points leaves Gemini 3.1 Pro and Grok 4 behind by 2.9 points. Gemini 3.1 Pro and Grok 4 tie for third place, both with 77.5 points in material constraints, indicating a quantifiable gap in constraint adherence.

Sharp Drops Yesterday and Surges Today: Clear Signs of Model Iteration

Claude Opus 4.7's main leaderboard score increased by 61.3 points compared to yesterday, while Qwen3 Max rose by 57.4 points. Combined with Grok 4, whose execution score jumped from 80 to 100, some models may have undergone targeted fine-tuning or prompt engineering optimizations overnight. However, such single-day fluctuations of around 60 points also confirm that the Smoke 10-question quick test is sensitive to small sample sizes.

Integrity Rating Becomes Key Risk: Only Three Models Maintain Pass

Today, only Gemini 3.1 Pro, GPT-5.5, and GPT-o3 received a pass integrity rating. The remaining eight models received a warn, with DeepSeek V4 Pro dropping directly from warn to fail. Claude Sonnet 4.6, despite its 81.7 in material constraints, only scored a warn, indicating that high-scoring models still face potential risks in citation accuracy and instruction following.

Overall, code execution has entered a plateau, while material constraints remain the main battleground. DeepSeek's fail integrity rating is worth monitoring; if it remains low in the next round, it could affect its adoption in enterprise scenarios.

A perfect execution score has become standard; the 81.7 in material constraints is Claude's true moat.

Data source: YZ Index | Run #137 | View raw data

Claude Sonnet 4.6 Takes Commanding Lead with 91.77 on Main Leaderboard, GPT-o3 Trails with Execution Score of 50

Perfect Scores in Execution Dimension, GPT-o3 the Only Underperformer

Material Constraints Determine True Ranking, Claude Duo Sweeps Top Two

Sharp Drops Yesterday and Surges Today: Clear Signs of Model Iteration

Integrity Rating Becomes Key Risk: Only Three Models Maintain Pass

Related Reviews

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index Grok 4 Main Score Plunges 8.4 Points, Material Constraint Drops 17.6 Points in a Single Day

Winzheng Index GLM-4.6 Scores 25 in Material Constraint, 88.7 in Code Execution, Zero on Integrity Probe