Qwen3 Max Plunges 19.2 Points on Main Leaderboard; Four Models Score Perfect in Execution and Constraint

On 2026-06-21, the Smoke Lightweight Evaluation shows that four models — DeepSeek V4 Pro, Gemini 3.1 Pro, GPT-o3, and Grok 4 — all achieved 100 points in the main leaderboard, code execution, and material constraint scores, forming a perfect match between execution and constraint.

Structural Characteristics of Perfect-Score Models

The four perfect-score models each have 100 points in code execution and material constraint, naturally arriving at 100 points under the core_overall formula: 0.55 × execution + 0.45 × constraint. Claude Opus 4.7 and Gemini 2.5 Pro follow closely with 99.28 points on the main leaderboard. Both have 100 points in execution but 98.4 points in constraint, indicating that material constraint is the only deduction point.

GPT-5.5 scores 97.98 points on the main leaderboard, with 100 in execution and 95.5 in constraint. 豆包Pro scores 96.63 points on the main leaderboard, with 100 in execution and 92.5 in constraint. Claude Sonnet 4.6 scores 96.49 points on the main leaderboard, with 100 in execution and 92.2 in constraint. The common feature of these three models is perfect execution scores but clearly lower constraint scores, presenting a structural pairing of "strong execution, weak constraint."

Anomaly Signals Concentrated in Execution Dimension

Qwen3 Max scored 80.82 points on the main leaderboard on the day, with 68.8 in execution and 95.5 in constraint. Compared to yesterday, its execution dimension dropped by 31.2 points, directly causing a 19.2-point drop in the main leaderboard. The constraint dimension remained at 95.5 points without a significant decline, indicating that this plunge mainly stemmed from a decrease in the stability of code execution tasks.

文心一言4.5 scored 88.28 points on the main leaderboard, with 81.3 in execution and 96.8 in constraint. Compared to yesterday, execution rose by 31.3 points, and the main leaderboard rose by 17.3 points. Constraint remained high, presenting a reverse structure of "stronger constraint than execution."

Weight Impact of Execution and Constraint

Since in core_overall the weight of code execution (0.55) is higher than that of material constraint (0.45), fluctuations in the execution dimension have a greater impact on the main leaderboard. After Qwen3 Max's execution dropped to 68.8 points, even with constraint still at 95.5 points, it could not pull back its main leaderboard ranking. In contrast, 文心一言4.5's high constraint score of 96.8 points could not offset the execution gap of 81.3 points, ultimately ranking tenth.

Yesterday, both Gemini 3.1 Pro and Gemini 2.5 Pro had execution scores of 50 points. Today, both rose to 100 points, lifting their main leaderboard scores by 29 points and 28.3 points respectively, showing that the rapid recovery of the execution dimension directly changed the day's rankings.

Large fluctuations in the execution dimension are becoming the core variable determining rankings in the Smoke Lightweight Evaluation.

In today's evaluation, models with perfect execution scores occupy the top nine positions. Only the tenth and eleventh models have execution scores below 82 points. In terms of material constraint, except for Qwen3 Max and 文心一言4.5, all other models have constraint scores above 92.2 points. Overall, the constraint level is higher than the dispersion of execution levels.


Data source: YZ Index | Run #190 | View raw data