Smoke Quick Test: Doubao Pro Scores 100 in Execution, 9 Models Plunge Over 30 Points on Main Leaderboard

May 22, 2026 449 Views - Read Source Winzheng Index

Doubao Pro Code Execution 主榜暴跌评测异常模型趋势

Today's Smoke lightweight evaluation results directly shattered market expectations for stable performance of mainstream models. Doubao Pro took the absolute top spot with 91.23 points, scoring a perfect 100 in code execution, 80.5 in material constraint, and passing the integrity rating, making it the only model with zero errors in the code section of the 10-question quick test.

Execution Dimension Collectively Sluggish, Test Difficulty May Have Escalated

The remaining models performed poorly on code execution: Gemini 3.1 Pro scored 57.2 in execution, the second highest, while others such as Claude Sonnet 4.6, Grok 4, Qwen3 Max, and GPT-5.5 all stalled at 50 points; Gemini 2.5 Pro and ERNIE Bot 4.5 even scored 0 points outright. This is not a simple ranking shift, but a断层式 collapse in execution capability.

Compared to yesterday's data, Gemini 2.5 Pro plunged 54.3 points on the main leaderboard, DeepSeek V4 Pro dropped 36.2 points, ERNIE Bot 4.5 fell 36.7 points, and Grok 4 and Qwen3 Max also declined by 34.7 and 34.3 points respectively. The collective halving or zeroing of execution scores points to a significant increase in today's 10-question code task difficulty, rather than a sudden failure of the models themselves.

Material Constraints Relatively Stable, but Anomalies Emerge

On the material constraint dimension, most models remained in the 70–81 point range, with Claude Sonnet 4.6 leading at 81 points, followed closely by Gemini 3.1 Pro and Claude Opus 4.7. However, Claude Opus 4.7's constraint score plunged 17.6 points in a single day, indicating noticeable fluctuations even for models with strong constraint capabilities.

In terms of integrity ratings, Gemini 2.5 Pro, ERNIE Bot 4.5, and Qwen3 Max shifted from warn or fail to pass, suggesting some improvement in compliant output. However, this did not offset the massive losses in the execution dimension.

Industry Insight: Code Ability Becomes a New Dividing Line

Doubao Pro's perfect execution score validates its continuous optimization on engineering tasks. The lack of consistency among other leading models in complex code scenarios exposes the limitations of current training and alignment strategies. Today's evaluation felt more like a stress test, revealing the vulnerability of most models under real-world engineering constraints.

Overall, Doubao Pro has established a clear generational advantage. If other models want to catch up, they must make targeted breakthroughs in the robustness of code execution, otherwise the gap on the main leaderboard will continue to widen.

Code execution is no longer a bonus; it is the battlefield that determines survival.

Data source: YZ Index (YZ Index) | Run #127 | View raw data

Smoke Quick Test: Doubao Pro Scores 100 in Execution, 9 Models Plunge Over 30 Points on Main Leaderboard

Execution Dimension Collectively Sluggish, Test Difficulty May Have Escalated

Material Constraints Relatively Stable, but Anomalies Emerge

Industry Insight: Code Ability Becomes a New Dividing Line

Related Reviews

Winzheng Index Doubao Pro Main Score Plunges 15 Points: Code Execution Drops from 75 to 58.3

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points