Today's Smoke Lite evaluation results show that Claude Opus 4.7, DeepSeek V4 Pro, and Qwen3 Max tied for first place on the main leaderboard with 88.75 points. All three scored a perfect 100 on code execution and 75 on material constraints. This marks a break from the previous Claude-dominated pattern, as open models are approaching the top-tier closed-source level at an accelerating pace.
Why did Claude's duo suddenly collapse?
The most striking anomaly is the collective decline of the Claude series. Claude Sonnet 4.6 dropped from 98.35 points yesterday to 86.05 points today, a plunge of 12.3 points, with its material constraints score halved from 96.3 to 69. Claude Opus 4.7 also fell from 97.75 to 88.75, a drop of 9 points on the main leaderboard. Both models simultaneously lost points on material constraints in the same 10-question quick test, strongly suggesting a temporary adjustment in internal system prompts or safety policies, rather than a permanent degradation of model capability.
The logic behind DeepSeek and Grok's comeback
In contrast, DeepSeek V4 Pro surged from 54.65 to 88.75 points, a gain of 34.1 points. Its material constraints score rose from 14.7 to 75, and its integrity rating improved from warn to pass. This indicates that its low score yesterday may have been caused by a single-run anomaly, and today's performance is closer to its true ceiling. Grok 4 also skyrocketed from 48.45 to 86.05 points, a gain of 38.3 points, showing that xAI's rapid iterations on the material constraints module are already bearing fruit.
Such dramatic fluctuations demonstrate the sensitivity of the Smoke evaluation: the 10-question quick test amplifies single-run variance, but also faithfully reflects that the current model iteration cycle has entered a "weekly update" phase, where any alignment or safety patch can cause severe score swings.
Industry insight: Material constraints become the new battleground
All models on today's leaderboard scored a perfect 100 on code execution, and the core overall score differences are almost entirely determined by material constraints. This suggests that code capability has plateaued by 2026, and the next competitive focus has shifted to "material constraints"—the model's compliance and consistency under restricted instructions. GPT-5.5 and GPT-o3 remain at 84.03 points, with a material constraints score of only 64.5, trailing the top tier by more than 10 points. OpenAI's lag in this dimension has persisted for two weeks.
When material constraints become the key variable determining rankings, any model attempting to "gain points" through safety policy adjustments will pay the price in quick tests.
Today's drastic reshuffling of the leaderboard signals that the AI model competition in the second half of 2026 will shift from "who runs the fastest" to "who can remain the strongest under constraints." Claude's short-term decline is likely a necessary calibration for long-term stability, while the breakout of DeepSeek and Grok marks that open-source/semi-open-source camps have truly reached a level to compete with closed-source giants on equal footing.
Data source: YZ Index | Run #119 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接