Grok 4 Tops with 98.34 Points, Claude Opus Plunges 31.3 Points on Main Leaderboard

Smoke's 10-question quick test results today directly exposed the huge gap in execution capabilities among models. Grok 4 ranked first with 98.34 points, scoring full marks in code execution and 96.3 points in material constraints, receiving only a warn rating, making it the most stable performer overall.

Why the Claude Series Collectively Stalled

Claude Opus 4.7 fell 31.3 points from yesterday's high on the main leaderboard, with code execution plunging 59.4 points to 38.1. Sonnet 4.6 also dropped 30.3 points, with its execution score decreasing by 47.5. Both models' integrity ratings changed from pass to warn, indicating issues with answer consistency.

The execution score accounts for 0.55 of the main leaderboard weight, meaning that any decline in code task performance will significantly drag down the main leaderboard. Among the 10 questions tested today, the Claude series made obvious mistakes on problems requiring multi-step reasoning and tool calls, with raw logs showing multiple instances of intermediate step interruptions.

Comparison: Full-Score Execution Camp vs. Decline Camp

Four models—Grok 4, Qwen3 Max, DeepSeek V4 Pro, and Gemini 2.5 Pro—achieved 100 points in code execution, occupying the top four positions. Among them, Gemini 2.5 Pro rose 15.9 points on the main leaderboard, mainly due to a rebound in execution score, but its material constraints dropped by 14 points, and its integrity rating changed from fail to warn.

豆包 Pro, GPT-5.5, Gemini 3.1 Pro, and GPT-o3 all had execution scores of 66.7, continuing to decline from yesterday. 文心一言 4.5 had an execution score of only 50, dropping another 42.5 points from the previous day; although its material constraints reached 95 points, it could only rank ninth due to its execution weakness.

Real Signals Under Industry Dynamics

The execution fluctuations of the Claude series in quick test scenarios like Smoke reflect their lack of consistency in lightweight, limited-material tasks. Grok 4 and Qwen3 Max maintained a high completion rate under the same conditions, indicating that their parsing of task instructions and code generation paths are more reliable.

Models that have experienced severe execution score fluctuations for several consecutive days should be cautious about whether they have entered a sensitive window for version iteration. Today's data clearly divides models into two categories: those that can stably output runnable code, and those that repeatedly make errors on the same problems.

Once execution capability collapses, the main leaderboard ranking will be brutally rewritten.

Data source: YZ Index | Run #126 | View Raw Data