Today's Smoke lightweight evaluation results show that Doubao Pro topped the chart with 97.75 points (Execution 100, Constraint 95), becoming the only model among 11 mainstream models to break 97 points on the main ranking. Following closely are GPT-o3 with 94.51 points and Claude Sonnet 4.6 with 93.7 points, while GPT-5.5, which was previously highly anticipated, scored only 60.58 points, a direct plunge of 23.5 points compared to yesterday.
Execution Score Halving Exposes Core Issues
GPT-5.5 today scored only 50 points in the execution dimension, a drop of at least 50 points from the previous day. This directly dragged down its core_overall score. The formula shows that the execution dimension weight is as high as 0.55, so a single dimension collapse has a significant impact on the total score. Combined with yesterday's data, GPT-5.5's execution score was previously maintained around 100 points. It is likely that in today's 10-question quick test, it experienced multiple failures in code execution consistency, leading to a larger standard deviation.
Material Constraint Becomes Today's Watershed
From the leaderboard, it can be seen that the top 7 models all scored 100 points in the execution dimension. What truly differentiated them was material constraint. Doubao Pro scored 95 points in constraint, GPT-o3 87.8 points, and Gemini 2.5 Pro only 80.3 points. Qwen3 Max and Gemini 3.1 Pro also saw constraint scores drop by 6.3 and 6 points respectively, indicating that today's questions placed higher demands on the models' "material constraint" ability. ERNIE Bot 4.5 scored 74.5 points in constraint and directly failed the integrity test, further confirming its weakness in factual anchoring.
Possible Reasons for Concurrent Decline of Multiple Models
Today, four models saw their main ranking drops exceed 10 points: GPT-5.5 (-23.5), ERNIE Bot 4.5 (-12.1), Gemini 3.1 Pro (-11.1), and Qwen3 Max (-10.9). This concentrated decline is unlikely to be caused by major version updates of the models themselves. Instead, it is more likely that today's 10-question Smoke test had a significant shift in the difficulty or distribution of the material constraint section. The execution dimension still maintained high scores for most models, indicating that basic code generation capabilities have not regressed, with the problem concentrated on "accuracy and consistency under given materials."
Doubao Pro's constraint score today increased by 26 points compared to yesterday, showing its stronger adaptability to material-dependent tasks in a lightweight evaluation environment. This is directly related to ByteDance's recent continuous investment in multimodal alignment and fact verification.
Industry Signals and Judgment
At the current stage, material constraint capability has become a key indicator distinguishing top-tier models from the second tier. The execution dimension has entered a stage where "passing means full marks." Future evaluation weights may further tilt toward constraint. GPT-5.5's performance today suggests that it may have sacrificed some stability during rapid iteration. It is necessary to monitor whether its scores continue to decline over the next two consecutive days of evaluation.
Material constraint determines the ceiling, execution full marks are just an entry ticket.
Data Source: YZ Index (YZ Index) | Run #121 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接