Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven models scored 100, while GPT-5.5 dropped to 50, directly dragging the main leaderboard down to 59.99.

The Real Gap Behind the Tie for First

文心一言4.5 and Grok 4 tied for first at 99.24, both with material constraint scores of 98.3. They each lost only one question out of the 10 in the quick test, and their lost questions were entirely different, indicating that current constraint capability has entered a stage of extremely minor differentiation.

Claude Opus 4.7 closely followed at 98.88, with a constraint score of 97.5. 豆包Pro and GPT-o3 both scored 98.65, both with perfect execution scores and constraint scores of 97. The constraint score range for the top six models is only between 96.3 and 98.3, making the actual distinction very limited.

GPT-5.5's Abnormally Low Score

GPT-5.5 is the only model that failed the execution dimension. An execution score of 50 means it answered at least half of the code execution questions incorrectly, consistent with yesterday's similar quick test results, revealing a persistent weakness in lightweight code tasks.

All other models achieved perfect execution scores, indicating that Smoke's current 10-question code tasks no longer effectively differentiate mainstream models. To create a meaningful gap in future quick tests, it may be necessary to increase task complexity or add multi-step reasoning chains.

Material Constraint Becomes the Only Variable

Today's main leaderboard ranking is almost entirely determined by material constraint. Gemini 3.1 Pro scored 87.3 in constraint, Qwen3 Max 83.3, DeepSeek V4 Pro 82, and Gemini 2.5 Pro 73.3. For every 5-point drop in constraint score, the main leaderboard drops by approximately 2.2–2.5 points, showing a clear weight impact.

All models received a "pass" integrity rating, with no abnormal signals, and stability also remained smooth. From an industry perspective, domestic models have formed direct competition with overseas closed-source models in material constraint, and the performance of 文心一言4.5 and 豆包Pro most directly demonstrates this.

When the execution dimension collectively achieves perfect scores, every small improvement in material constraint will become a decisive factor in the main leaderboard ranking.

Data source: YZ Index (YZ Index) | Run #147 | View raw data