Smoke Evaluation: Claude Sonnet 4.6 Leads with 99.78 Points, GPT Series Stuck at 74 Points

Smoke lightweight evaluation completed a 10-question quick test on 11 mainstream models at 3:00 AM today. The mainboard core formula of 0.55× code execution + 0.45× material constraints once again confirms the current polarization of AI capabilities.

Top three achieve perfect execution scores, material constraints create gaps

Claude Sonnet 4.6 ranks first with 99.78 points, scoring 100 in code execution and 99.5 in material constraints. DeepSeek V4 Pro and Gemini 3.1 Pro tie for second with 99.24 points, both also scoring full marks in execution but 98.3 in material constraints. The gap among the top three is only 0.54 points, indicating that top-tier models are already highly close in code generation and factual constraint capabilities.

豆包Pro scores 94.96 on the mainboard, with 100 in execution but 88.8 in constraints, revealing its clear weakness in scenarios requiring strict material citation.

Seven models stuck at 74 points on mainboard; constraint scores become a hard bottleneck

GPT-5.5, GPT-o3, Grok 4, and Qwen3 Max all achieve 100 in execution, but their material constraint scores are 75, 64.5, 97, and 73.3 respectively, ultimately locking their mainboard scores at 74 points. Grok 4's constraint score of 97 is dragged down by an integrity rating of "fail," reflecting the evaluation's strict enforcement of the integrity threshold.

文心一言4.5 scores only 50 in execution, becoming the only model without a perfect score, ranking last on the mainboard at 66.43 points, exposing its shortcoming in coding capability.

No abnormal fluctuations; the landscape is becoming solidified

Compared to yesterday, all model scores remain unchanged. Multi-day data indicates that the current tiers have entered a stable period: the top three dominate with extreme material constraints, the middle tier achieves full execution scores but is locked in the 90-95 range due to insufficient constraints, and the lower-tier models face dual problems of integrity or execution, making short-term breakthroughs difficult.

The 74-point score is not an execution issue, but a dual ceiling of material constraints and integrity.

The industry is shifting from "being able to write code" to "writing trustworthy code." The next phase of competition will focus on whether material constraints and integrity ratings can improve in tandem.


Data source: Winzheng Index (YZ Index) | Run #141 | View raw data