Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

May 17, 2026 21 Views - Read Source Winzheng Index

Claude Opus 4.7 材料约束 Smoke轻量评测模型迭代性能波动

Today's Smoke Lite evaluation results show that Claude Opus 4.7, DeepSeek V4 Pro, and Qwen3 Max tied for first place on the main leaderboard with 88.75 points. All three scored a perfect 100 on code execution and 75 on material constraints. This marks a break from the previous Claude-dominated pattern, as open models are approaching the top-tier closed-source level at an accelerating pace.

Why did Claude's duo suddenly collapse?

The most striking anomaly is the collective decline of the Claude series. Claude Sonnet 4.6 dropped from 98.35 points yesterday to 86.05 points today, a plunge of 12.3 points, with its material constraints score halved from 96.3 to 69. Claude Opus 4.7 also fell from 97.75 to 88.75, a drop of 9 points on the main leaderboard. Both models simultaneously lost points on material constraints in the same 10-question quick test, strongly suggesting a temporary adjustment in internal system prompts or safety policies, rather than a permanent degradation of model capability.

The logic behind DeepSeek and Grok's comeback

In contrast, DeepSeek V4 Pro surged from 54.65 to 88.75 points, a gain of 34.1 points. Its material constraints score rose from 14.7 to 75, and its integrity rating improved from warn to pass. This indicates that its low score yesterday may have been caused by a single-run anomaly, and today's performance is closer to its true ceiling. Grok 4 also skyrocketed from 48.45 to 86.05 points, a gain of 38.3 points, showing that xAI's rapid iterations on the material constraints module are already bearing fruit.

Such dramatic fluctuations demonstrate the sensitivity of the Smoke evaluation: the 10-question quick test amplifies single-run variance, but also faithfully reflects that the current model iteration cycle has entered a "weekly update" phase, where any alignment or safety patch can cause severe score swings.

Industry insight: Material constraints become the new battleground

All models on today's leaderboard scored a perfect 100 on code execution, and the core overall score differences are almost entirely determined by material constraints. This suggests that code capability has plateaued by 2026, and the next competitive focus has shifted to "material constraints"—the model's compliance and consistency under restricted instructions. GPT-5.5 and GPT-o3 remain at 84.03 points, with a material constraints score of only 64.5, trailing the top tier by more than 10 points. OpenAI's lag in this dimension has persisted for two weeks.

When material constraints become the key variable determining rankings, any model attempting to "gain points" through safety policy adjustments will pay the price in quick tests.

Today's drastic reshuffling of the leaderboard signals that the AI model competition in the second half of 2026 will shift from "who runs the fastest" to "who can remain the strongest under constraints." Claude's short-term decline is likely a necessary calibration for long-term stability, while the breakout of DeepSeek and Grok marks that open-source/semi-open-source camps have truly reached a level to compete with closed-source giants on equal footing.

Data source: YZ Index | Run #119 | View raw data

Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

Why did Claude's duo suddenly collapse?

The logic behind DeepSeek and Grok's comeback

Industry insight: Material constraints become the new battleground

Related Reviews

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

Winzheng Index Two Zero-Execution Shocks, Claude Holds at 88.75

Winzheng Index Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Winzheng Index Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails