GPT-5.5 Main Ranking Plunges 23.5 Points, Doubao Pro 97.75 Tops Smoke

May 18, 2026 350 Views - Read Source Winzheng Index

Doubao Pro GPT-5.5 Smoke Test 主榜波动 Material Constraints

Today's Smoke lightweight evaluation results show that Doubao Pro topped the chart with 97.75 points (Execution 100, Constraint 95), becoming the only model among 11 mainstream models to break 97 points on the main ranking. Following closely are GPT-o3 with 94.51 points and Claude Sonnet 4.6 with 93.7 points, while GPT-5.5, which was previously highly anticipated, scored only 60.58 points, a direct plunge of 23.5 points compared to yesterday.

Execution Score Halving Exposes Core Issues

GPT-5.5 today scored only 50 points in the execution dimension, a drop of at least 50 points from the previous day. This directly dragged down its core_overall score. The formula shows that the execution dimension weight is as high as 0.55, so a single dimension collapse has a significant impact on the total score. Combined with yesterday's data, GPT-5.5's execution score was previously maintained around 100 points. It is likely that in today's 10-question quick test, it experienced multiple failures in code execution consistency, leading to a larger standard deviation.

Material Constraint Becomes Today's Watershed

From the leaderboard, it can be seen that the top 7 models all scored 100 points in the execution dimension. What truly differentiated them was material constraint. Doubao Pro scored 95 points in constraint, GPT-o3 87.8 points, and Gemini 2.5 Pro only 80.3 points. Qwen3 Max and Gemini 3.1 Pro also saw constraint scores drop by 6.3 and 6 points respectively, indicating that today's questions placed higher demands on the models' "material constraint" ability. ERNIE Bot 4.5 scored 74.5 points in constraint and directly failed the integrity test, further confirming its weakness in factual anchoring.

Possible Reasons for Concurrent Decline of Multiple Models

Today, four models saw their main ranking drops exceed 10 points: GPT-5.5 (-23.5), ERNIE Bot 4.5 (-12.1), Gemini 3.1 Pro (-11.1), and Qwen3 Max (-10.9). This concentrated decline is unlikely to be caused by major version updates of the models themselves. Instead, it is more likely that today's 10-question Smoke test had a significant shift in the difficulty or distribution of the material constraint section. The execution dimension still maintained high scores for most models, indicating that basic code generation capabilities have not regressed, with the problem concentrated on "accuracy and consistency under given materials."

Doubao Pro's constraint score today increased by 26 points compared to yesterday, showing its stronger adaptability to material-dependent tasks in a lightweight evaluation environment. This is directly related to ByteDance's recent continuous investment in multimodal alignment and fact verification.

Industry Signals and Judgment

At the current stage, material constraint capability has become a key indicator distinguishing top-tier models from the second tier. The execution dimension has entered a stage where "passing means full marks." Future evaluation weights may further tilt toward constraint. GPT-5.5's performance today suggests that it may have sacrificed some stability during rapid iteration. It is necessary to monitor whether its scores continue to decline over the next two consecutive days of evaluation.

Material constraint determines the ceiling, execution full marks are just an entry ticket.

Data Source: YZ Index (YZ Index) | Run #121 | View Raw Data

GPT-5.5 Main Ranking Plunges 23.5 Points, Doubao Pro 97.75 Tops Smoke

Execution Score Halving Exposes Core Issues

Material Constraint Becomes Today's Watershed

Possible Reasons for Concurrent Decline of Multiple Models

Industry Signals and Judgment

Related Reviews

Winzheng Index Qwen3 Max Material Constraint Plunges 15.1 Points While Code Execution Surges 18.4 Points

Winzheng Index Doubao Pro tops Smoke benchmark with 98.61 points, Claude's Execution plummets to 50 points

Winzheng Index Doubao Pro Material Constraint Plunges 15.9 Points: Causes of Smoke Single-Day Test Anomaly

Winzheng Index Material Constraints Plunge 20 Points Collectively, Claude Opus 4.7 Holds First with 90.78 Points