Smoke Quick Test: ERNIE Bot 4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Jun 4, 2026 569 Views - Read Source Winzheng Index

ERNIE Bot Material Constraints Smoke Test 主榜排名 Code Execution

Smoke Quick Test: ERNIE Bot 4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven models scored 100, while GPT-5.5 dropped to 50, directly dragging the main leaderboard down to 59.99.

The Real Gap Behind the Tie for First

ERNIE Bot 4.5 and Grok 4 tied for first at 99.24, both with material constraint scores of 98.3. They each lost only one question out of the 10 in the quick test, and their lost questions were entirely different, indicating that current constraint capability has entered a stage of extremely minor differentiation.

Claude Opus 4.7 closely followed at 98.88, with a constraint score of 97.5. Doubao Pro and GPT-o3 both scored 98.65, both with perfect execution scores and constraint scores of 97. The constraint score range for the top six models is only between 96.3 and 98.3, making the actual distinction very limited.

GPT-5.5's Abnormally Low Score

GPT-5.5 is the only model that failed the execution dimension. An execution score of 50 means it answered at least half of the code execution questions incorrectly, consistent with yesterday's similar quick test results, revealing a persistent weakness in lightweight code tasks.

All other models achieved perfect execution scores, indicating that Smoke's current 10-question code tasks no longer effectively differentiate mainstream models. To create a meaningful gap in future quick tests, it may be necessary to increase task complexity or add multi-step reasoning chains.

Material Constraint Becomes the Only Variable

Today's main leaderboard ranking is almost entirely determined by material constraint. Gemini 3.1 Pro scored 87.3 in constraint, Qwen3 Max 83.3, DeepSeek V4 Pro 82, and Gemini 2.5 Pro 73.3. For every 5-point drop in constraint score, the main leaderboard drops by approximately 2.2–2.5 points, showing a clear weight impact.

All models received a "pass" integrity rating, with no abnormal signals, and stability also remained smooth. From an industry perspective, domestic models have formed direct competition with overseas closed-source models in material constraint, and the performance of ERNIE Bot 4.5 and Doubao Pro most directly demonstrates this.

When the execution dimension collectively achieves perfect scores, every small improvement in material constraint will become a decisive factor in the main leaderboard ranking.

Data source: YZ Index (YZ Index) | Run #147 | View raw data

Smoke Quick Test: ERNIE Bot 4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50

The Real Gap Behind the Tie for First

GPT-5.5's Abnormally Low Score

Material Constraint Becomes the Only Variable

Related Reviews

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

Winzheng Index DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

Winzheng Index Grok 4's Main Score Plummets 11.3 Points in Smoke Evaluation, Material Constraint Drops 18 Points in a Single Day

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100