Smoke Evaluation: Claude Sonnet 4.6 Leads with 99.78 Points, GPT Series Stuck at 74 Points

Jun 1, 2026 549 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Material Constraints Smoke Test 主榜排名 Code Execution

Smoke lightweight evaluation completed a 10-question quick test on 11 mainstream models at 3:00 AM today. The mainboard core formula of 0.55× code execution + 0.45× material constraints once again confirms the current polarization of AI capabilities.

Top three achieve perfect execution scores, material constraints create gaps

Claude Sonnet 4.6 ranks first with 99.78 points, scoring 100 in code execution and 99.5 in material constraints. DeepSeek V4 Pro and Gemini 3.1 Pro tie for second with 99.24 points, both also scoring full marks in execution but 98.3 in material constraints. The gap among the top three is only 0.54 points, indicating that top-tier models are already highly close in code generation and factual constraint capabilities.

Doubao Pro scores 94.96 on the mainboard, with 100 in execution but 88.8 in constraints, revealing its clear weakness in scenarios requiring strict material citation.

Seven models stuck at 74 points on mainboard; constraint scores become a hard bottleneck

GPT-5.5, GPT-o3, Grok 4, and Qwen3 Max all achieve 100 in execution, but their material constraint scores are 75, 64.5, 97, and 73.3 respectively, ultimately locking their mainboard scores at 74 points. Grok 4's constraint score of 97 is dragged down by an integrity rating of "fail," reflecting the evaluation's strict enforcement of the integrity threshold.

ERNIE Bot 4.5 scores only 50 in execution, becoming the only model without a perfect score, ranking last on the mainboard at 66.43 points, exposing its shortcoming in coding capability.

No abnormal fluctuations; the landscape is becoming solidified

Compared to yesterday, all model scores remain unchanged. Multi-day data indicates that the current tiers have entered a stable period: the top three dominate with extreme material constraints, the middle tier achieves full execution scores but is locked in the 90-95 range due to insufficient constraints, and the lower-tier models face dual problems of integrity or execution, making short-term breakthroughs difficult.

The 74-point score is not an execution issue, but a dual ceiling of material constraints and integrity.

The industry is shifting from "being able to write code" to "writing trustworthy code." The next phase of competition will focus on whether material constraints and integrity ratings can improve in tandem.

Data source: Winzheng Index (YZ Index) | Run #141 | View raw data

Smoke Evaluation: Claude Sonnet 4.6 Leads with 99.78 Points, GPT Series Stuck at 74 Points

Top three achieve perfect execution scores, material constraints create gaps

Seven models stuck at 74 points on mainboard; constraint scores become a hard bottleneck

No abnormal fluctuations; the landscape is becoming solidified

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

Winzheng Index DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

Winzheng Index Grok 4's Main Score Plummets 11.3 Points in Smoke Evaluation, Material Constraint Drops 18 Points in a Single Day

Winzheng Index Claude Sonnet 4.6 Code Execution Drops 22 Points, Material Compliance Rises 25.7 Points