Grok 4 Tops with 97.44 Points, GPT-o3 Plunges 28 Points on Main Leaderboard

Smoke's 10-question quick test this morning exposed the execution weaknesses of AI models directly in the spotlight. Grok 4 topped the leaderboard with 97.44 points (Execution: 100, Constraint: 94.3), followed closely by Gemini 3.1 Pro with a gap of just 0.23 points, while GPT-o3's main leaderboard score plunged directly from 94.53 points yesterday to 66.43 points, a drop of 28.1 points.

Execution Weight 0.55, Coding Tasks Become the Lifeline

The core formula core_overall = 0.55×Code Execution + 0.45×Material Constraint made today's ranking almost directly determined by execution scores. Five models — Grok 4, Gemini 3.1 Pro, Gemini 2.5 Pro, DeepSeek V4 Pro, and Doubao Pro — all achieved a perfect 100, while the remaining six models collectively fell to 50 points. GPT-o3, GPT-5.5, Qwen3 Max, Wenxin Yiyan 4.5, Claude Opus 4.7, and Claude Sonnet 4.6 simultaneously faltered on code execution, indicating that today's test set included questions requiring multi-step reasoning and tool invocation.

Why Did Claude and GPT Series Collapse Collectively?

Claude Opus 4.7 and Sonnet 4.6 saw main leaderboard drops of 22.6 and 22.8 points respectively, with execution scores falling from 100 to 50, while material constraint scores rose slightly. Combined with yesterday's data, both models showed extreme volatility in execution, most likely due to refusal to execute or generating incomplete code under new instructions or sandbox environment changes. GPT-o3 suffered the largest decline, with execution scores halved and material constraint also dropping from yesterday's high, indicating simultaneous failure under both coding and factual pressures.

In contrast, Grok 4 and the Gemini series maintained perfect execution scores and kept material constraint above 92, demonstrating their ability to both write runnable code and strictly adhere to material boundaries in today's 10 questions. Although DeepSeek V4 Pro ranked fourth, its constraint score of only 86.2 still shows a clear gap from the top three. To break into the top three in the future, it needs to improve constraint by another 7–8 points.

Anomalous Signals Behind Industry Trends

It is difficult to explain the collective halving of execution scores across six models today as mere "random fluctuation." A more likely background is that some vendors pushed safety or alignment updates in mid-May, which often increase model "caution" but directly harm code execution continuity. The simultaneous warning-to-pass integrity rating changes in the Claude and GPT series also confirm that model behavior has been recalibrated.

Doubao Pro's 15.2-point drop in material constraint is more likely due to extreme penalty on a single question, requiring continued observation to determine whether it stems from data contamination or an update to the evaluation question bank.

Execution capability is becoming the true watershed of the 2026 mid-year battle.

Today's Smoke data once again proves: no matter how high the constraint score, if execution drops to 50, the overall ranking will be left more than 20 points behind the front-runners. Grok 4 and Gemini have built a clear moat in code execution. If Claude and GPT series do not fix execution continuity in their next iteration, they risk being permanently pushed out of the top five.


Data source: YZ Index | Run #123 | View raw data