Two Zero-Execution Shocks, Claude Holds at 88.75

Today’s Smoke isn’t just about Claude winning: 9 models scored full marks in code execution, but the real gap lies in material constraint; ERNIE Bot 4.5 and Grok 4 dropped to zero in code execution.

At 3:00 AM on May 15, the YZ Index Smoke lightweight benchmark tested 11 mainstream models with 10 quick questions, focusing on two auditable metrics: code execution and material constraint. The main ranking formula is: 0.55 × code execution + 0.45 × material constraint. The results are striking: Claude Opus 4.7 ranks first with 88.75, scoring 100 in code execution and 75 in material constraint, with an integrity rating of pass; Claude Sonnet 4.6 and Qwen3 Max tie at 86.05, both with full execution marks and 69 in material constraint.

First Judgment: Code Execution Is Becoming “Saturated,” Material Constraint Is the Hard Barrier

Today, the top 9 models all scored 100 in code execution—from Claude, Qwen, Doubao, Gemini to GPT-o3, DeepSeek V4 Pro—all able to run the tasks. This indicates a trend: in lightweight tasks, mainstream models’ code execution ability has entered a homogenized zone. In other words, the ability to write and run code is no longer scarce.

The real gap appears in material constraint. Opus 4.7 scored 75, the only model above 70 today; Sonnet 4.6 and Qwen3 Max scored 69; Doubao Pro, Gemini 2.5 Pro, Gemini 3.1 Pro, GPT-5.5, GPT-o3, and DeepSeek V4 Pro all stalled at 64.5. This distribution shows that models have not yet fully solved the problem of “sticking to the material, not overstepping, and not hallucinating.”

Today’s ranking is not a competition of execution ability, but a filter of constraint. The model that better follows the material is closer to enterprise readiness.

Claude Wins, but Sonnet Has a Red Flag

Claude Opus 4.7’s victory today is clean: execution 100, material constraint 75, integrity rating pass, main ranking 88.75. It didn’t rely on a single breakout but achieved the best combination of “full execution + relatively stronger constraint.”

However, Claude Sonnet 4.6, from the same family, is not stable. Although it still ranks second with 86.05, its material constraint dropped sharply by 27.5 points from yesterday—one of the most notable anomalies to watch today. For production environments, a sudden drop in material constraint is more troublesome than a single mistake, as it often indicates drift in the model’s citation boundaries, instruction compliance, or context selection. Claude’s brand strength has always been reliable output; if Sonnet continues to decline, it will shift from a “default safe option” to a “high-performance option requiring review.”

Mixed Fortunes for Domestic Models: Qwen Steady, ERNIE Bot Collapses

Qwen3 Max performed strongly today, with a main ranking of 86.05, tied for second with Claude Sonnet 4.6, achieving 100 in execution, 69 in material constraint, and an integrity rating of pass. The value of this result lies in its balanced performance across both core metrics, entering the first tier. For domestic enterprises, Qwen3 Max is no longer just an “alternative” but a model that can enter the main candidate pool.

Doubao Pro also saw a significant recovery, with a main ranking of 84.03, up 10.2 from yesterday, and an increase of 25 points in execution; however, material constraint dropped by 8 points, indicating that today’s improvement came mainly from execution fixes rather than an overall enhancement in constraint capability.

On the other hand, ERNIE Bot 4.5 performed very poorly today: main ranking 29.03, plummeting 44 points from yesterday, with code execution dropping from yesterday’s 69 to 0, and material constraint falling 13.5 to 64.5, with an integrity rating of warn. A zero in execution in a quick 10-question test like Smoke is a strong alarm—possibly due to issues in the runtime pipeline, tool invocation, question-type adaptation, or output format. Regardless of the cause, what users see is only one result: the task was not completed.

Grok 4 and DeepSeek Have Different Problems

Grok 4 had a main ranking of 11.25, with 0 in code execution, 25 in material constraint, and an integrity rating of fail—a drop of 38.2 from yesterday’s main ranking. This is not a small fluctuation but a failure in core ability during this quick test. In particular, its material constraint of only 25 means it not only failed to complete the code tasks but also failed to adhere to the boundary of “answering based on the given material.”

DeepSeek V4 Pro is more nuanced: execution 100, material constraint 64.5, but its integrity rating dropped from pass to fail, with a main ranking of 74. The key point here is not whether it can write code, but whether it can be safely included in the same procurement pool. Integrity rating is a threshold, not a bonus; once it reaches fail, enterprises should not just focus on the full execution score but first investigate whether there is a risk of untrustworthy outputs.

  • First Tier: Claude Opus 4.7, the only model with a main ranking close to 90 and material constraint of 75.
  • Chasers: Claude Sonnet 4.6, Qwen3 Max—same score, but Sonnet shows a sharp drop in constraint.
  • Mid-Crowded Zone: Doubao, Gemini, GPT-o3 all at 84.03, differentiated mainly by integrity rating and future fluctuations.
  • Risk Zone: ERNIE Bot 4.5, Grok 4—zero execution should not be downplayed.

My conclusion is clear: in 2026, model competition has shifted from “who answers better” to “who goes off the rails less.” Today’s Smoke ranking signals that code execution is becoming an infrastructure capability, while material constraint is the moat for high-end models.

The next phase: enterprises buy models not for the most articulate one, but for the one least likely to overstep.

Data source: YZ Index | Run #117 | View raw data