In the Smoke lightweight evaluation on 2026-06-22, GPT-5.5 scored 100 points on the main leaderboard, 100 on Execution, and 100 on Constraint. GPT-o3 also scored 100 on the main leaderboard, 100 on Execution, and 100 on Constraint, tying for first with perfect scores.
Structural Characteristics of Perfect Score Models
GPT-5.5 and GPT-o3 both achieved 100 points in the two dimensions of code execution and material constraint, achieving a perfect balance under the core_overall formula: 0.55×Execution + 0.45×Constraint. Claude Opus 4.7 scored 99.01 on the main leaderboard, with 100 on Execution and 97.8 on Constraint, indicating a 0.2-point gap on the constraint side.
Differences in Strong/Weak Alignment of Execution and Constraint
Models ranked 4th to 7th—Claude Sonnet 4.6, 豆包Pro, Gemini 3.1 Pro, and Grok 4—all scored 98.83 on the main leaderboard, 100 on Execution, and 97.4 on Constraint. DeepSeek V4 Pro scored 97.8 on the main leaderboard, 100 on Execution, and 95.1 on Constraint, with the constraint side dragging down the overall score under the 0.45 weight.
Qwen3 Max scored 85.96 on the main leaderboard, 100 on Execution, and 68.8 on Constraint, with the constraint side significantly lower than previous models. Gemini 2.5 Pro scored 71.33 on the main leaderboard, only 50 on Execution, and 97.4 on Constraint, making the execution side the main weakness. 文心一言4.5 scored 47.98 on the main leaderboard, 50 on Execution, and 45.5 on Constraint, with both dimensions at low levels.
Abnormal Fluctuations Compared to Yesterday
文心一言4.5's main leaderboard dropped 40.3 points compared to yesterday, with Execution dropping 31.3 points and Constraint dropping 51.3 points. Gemini 2.5 Pro's main leaderboard dropped 28 points, with Execution dropping 50 points. Qwen3 Max's main leaderboard rose 5.1 points, but Constraint dropped 26.7 points, while Execution rose 31.2 points.
Claude Sonnet 4.6's main leaderboard rose 2.3 points, and Constraint rose 5.2 points. 豆包Pro's main leaderboard rose 2.2 points. There were a large number of models scoring 100 on Execution in today's evaluation, but the constraint side scores ranged from 100 to 45.5 points.
Structural Interpretation of Abnormal Signals
After Qwen3 Max's material constraint plunged 26.7 points, its main leaderboard still maintained 85.96 points, showing the supporting effect of Execution's 100 points on the overall score. Gemini 2.5 Pro's execution side fell from a possibly high level yesterday to 50 points, directly causing a 28-point drop in the main leaderboard. 文心一言4.5's Execution and Constraint both fell sharply, and core_overall, affected by the dual weights of 0.55 and 0.45, experienced the largest decline overall.
These fluctuations only reflect the results of the 10-question quick test on that day. The differences in the combination of Execution and Constraint determine the real-time ranking positions of each model in the Smoke evaluation.
The gap between Execution's 100 points and Constraint's 45.5 points determines 文心一言4.5's main leaderboard position of 47.98 points today.
Data source: 赢政指数 (YZ Index) | Run #191 | View Original Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接