Smoke Daily: GPT-5.5 tops with 92.58 points, material constraint gap of 19 points decides the outcome

As soon as Smoke released the data early this morning, the most direct conclusion was clear: code execution is no longer the dividing line, and material constraints have become the real battlefield.

Real gap hidden behind perfect execution scores

The top nine models all scored 100 points in code execution, meaning they could all produce runnable code on the 10 quick-test questions. What truly determines the ranking is the material constraint metric. GPT-5.5 scored 83.5 points, while Wenxin Yiyan 4.5 scored only 64.3, a gap of 19.2 points. After being amplified by the 0.45 weight, this directly caused a total score difference of more than 36 points on the main leaderboard.

This pattern is no accident. Over the past six months, mainstream models have rapidly converged in code ability, making execution questions merely "passing questions." Now the competition is about whether the model fabricates when citing external materials, whether it ignores constraint conditions, and whether it writes information that should not be exposed into code comments.

Top five almost tied

The total score difference among GPT-5.5, Doubao Pro, Claude Opus 4.7, Gemini 3.1 Pro, and Claude Sonnet 4.6 is less than 2.5 points. Doubao Pro ranks second thanks to its 82.3-point constraint score, proving its advantage in Chinese material processing. Although Claude Opus 4.7's constraint score of 81 is slightly lower, it still firmly holds third place, indicating that its accumulated engineering judgment (side leaderboard, AI-assisted evaluation) remains effective.

In contrast, GPT-o3 and Wenxin Yiyan 4.5 dropped directly to 50 points in execution, showing that they already have errors that prevent passing in the quick-test code questions. These two models can only barely stay above the passing line by relying on material constraints.

Industry signal: constraint ability is being priced

Based on today's data, each 1-point increase in constraint score contributes 0.45 points to the main leaderboard. Meanwhile, execution is approaching its ceiling, so continuing to stack execution ability yields far lower marginal returns than stacking constraints. In the next three months, it is expected that labs will shift more RLHF resources toward "material usage compliance" rather than "writing code faster."

The absence of abnormal fluctuations also indicates one thing: the current distribution of model capabilities is relatively solidified, and it is unlikely that a dark horse will overturn the top five in the short term.

For every point lost in material constraints, the model incurs one more point of "unusable" risk in real-world deployment scenarios.

Data source: YZ Index (YZ Index) | Run #155 | View raw data