GPT-5.5 Tops Smoke Chart with Material Constraint Score of 71, All Models Get Full Code Score but Gap Widens in Second Half

The most direct finding from today's Smoke lightweight benchmark is that code execution ability is no longer a differentiating factor among the top seven models. All models scored 100, and rankings were entirely determined by material constraint scores.

The True Ranking Logic Under Full Code Scores

In the scoring formula, code execution carries a weight of 0.55, while material constraint carries 0.45. Currently, the top seven models all achieved full marks in code execution, while material constraint scores dropped from 71 (GPT-5.5) to 55 (DeepSeek V4 Pro), directly widening the gap on the main leaderboard. GPT-5.5 achieved an overall score of 86.95 with its constraint score of 71, while the second-place GPT-o3 had a constraint score of only 66.8, trailing by nearly 2 points.

This phenomenon indicates that mainstream models in 2026 have generally reached a high level in code execution tasks, and the next phase of competition has shifted to the ability to strictly follow user instructions and context.

Hard Hits for Models in the Lower Half

Claude Opus 4.7,<|eos|>


Data source: YZ Index | Run #143 | View raw data