GPT-5.5 Leads Smoke Benchmark with Perfect Execution Score of 86.95, Exposing Constraint Weakness

In the Smoke lightweight benchmark on July 3, 2026, GPT-5.5 ranked first on the main leaderboard with a score of 86.95, a result directly determined by its combination of a perfect code execution score of 100 and a material constraint score of 71.

Structural Differences in Execution and Constraint

The scoring formula core_overall = 0.55 × code execution + 0.45 × material constraint gives higher weight to execution, making GPT-5.5's perfect execution score the key to its victory. Claude Sonnet 4.6 scored 99.3 in execution and 70 in constraint, earning a main score of 86.12, also relying on its execution advantage. Claude Opus 4.7 scored a perfect 100 in execution as well, but only 67.4 in constraint, resulting in a main score of 85.33, trailing GPT-5.5 by approximately 1.62 points.

Qwen3 Max scored 96.3 in execution and 71 in constraint, with a main score of 84.92, forming a clear tier behind the top three. Grok 4 scored 92.1 in execution and 63.3 in constraint, with a main score of 79.14, where constraint dragged down overall performance.

Models Stronger in Constraint than Execution

豆包 Pro scored 75 in execution and 81.7 in constraint, with a main score of 78.02, making it the only model among the top six with constraint higher than execution. Gemini 2.5 Pro scored 74.3 in execution and 75 in constraint, with a main score of 74.62, with the two dimensions close to each other. Gemini 3.1 Pro scored only 50 in execution but 81.7 in constraint, with a main score of 64.27, showing the limited pull of constraint on ranking.

DeepSeek V4 Pro scored 50 in execution and 70 in constraint, with a main score of 59. 文心一言 4.5 scored 0 in both execution and constraint, with a main score of 0 and a fail integrity rating, thus not entering the effective ranking.

Inferred Model Characteristics

GPT-5.5 and Claude Opus 4.7, both with perfect execution scores, have reached the ceiling in code execution, but neither exceeds 71 in constraint, reflecting that material constraint remains a common weakness among current models. 豆包 Pro's constraint score of 81.7 is the highest among the 11 models, indicating its relative advantage in material constraint tasks.

Overall, the top five models all have execution scores above 92.1, while the bottom five have execution scores of 75 or below, clearly demonstrating the decisive role of the execution dimension in the main leaderboard ranking.

In the execution-dominated landscape, the constraint weakness defines the ceiling.

Data source: YZ Index | Run #210 | View Raw Data