Smoke Lite benchmark data released this morning shows Claude Sonnet 4.6 firmly in first place on the main leaderboard with 91.77 points, achieving a perfect 100 in code execution and 81.7 in material constraints. This result is mainly due to a clear advantage in the material constraints dimension, outperforming the second-place Claude Opus 4.7 by 2.3 points.
Perfect Scores in Execution Dimension, GPT-o3 the Only Underperformer
Ten out of 11 models scored 100 in code execution, with only GPT-o3 stuck at 50. This directly resulted in its main leaderboard score of only 62.83, placing it last. The formula shows an execution weight of 0.55, meaning GPT-o3 lost over 27.5 points in this dimension, far outweighing any advantage from material constraints.
Material Constraints Determine True Ranking, Claude Duo Sweeps Top Two
Among the top five, four models share a perfect execution score of 100, making material constraints the sole differentiator. Claude Sonnet 4.6's 81.7 points leaves Gemini 3.1 Pro and Grok 4 behind by 2.9 points. Gemini 3.1 Pro and Grok 4 tie for third place, both with 77.5 points in material constraints, indicating a quantifiable gap in constraint adherence.
Sharp Drops Yesterday and Surges Today: Clear Signs of Model Iteration
Claude Opus 4.7's main leaderboard score increased by 61.3 points compared to yesterday, while Qwen3 Max rose by 57.4 points. Combined with Grok 4, whose execution score jumped from 80 to 100, some models may have undergone targeted fine-tuning or prompt engineering optimizations overnight. However, such single-day fluctuations of around 60 points also confirm that the Smoke 10-question quick test is sensitive to small sample sizes.
Integrity Rating Becomes Key Risk: Only Three Models Maintain Pass
Today, only Gemini 3.1 Pro, GPT-5.5, and GPT-o3 received a pass integrity rating. The remaining eight models received a warn, with DeepSeek V4 Pro dropping directly from warn to fail. Claude Sonnet 4.6, despite its 81.7 in material constraints, only scored a warn, indicating that high-scoring models still face potential risks in citation accuracy and instruction following.
Overall, code execution has entered a plateau, while material constraints remain the main battleground. DeepSeek's fail integrity rating is worth monitoring; if it remains low in the next round, it could affect its adoption in enterprise scenarios.
A perfect execution score has become standard; the 81.7 in material constraints is Claude's true moat.
Data source: YZ Index | Run #137 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接