In the Smoke Lite Evaluation on June 15, 2026, Grok 4 ranked first among the 11 models with a main score of 82.59 (Execution 100, Constraint 61.3 pass), but its Material Constraint score dropped sharply by 21.7 points from the previous day.
Constraint Shortcomings of Models with Perfect Execution Scores
The top eight models all achieved a perfect score of 100 in the execution dimension, but their Material Constraint scores were concentrated in the range of 51.3 to 61.3. Grok 4, 豆包 Pro, GPT-5.5, and Qwen3 Max all scored 100 in execution, with constraint scores of 61.3, 60.8, 60.8, and 60.3 respectively, and main leaderboard scores of 82.59, 82.36, 82.36, and 82.14. Claude Opus 4.7 had an execution score of 100 and a constraint score of 59.3, a main score of 81.69, and an integrity rating of warn.
The bottom three models only scored 50 in execution: Gemini 2.5 Pro had a main score of 53.38 (constraint 57.5), Gemini 3.1 Pro had a main score of 53.06 (constraint 56.8), and 文心一言 4.5 had a main score of 50.59 (constraint 51.3). The clear divergence between execution and constraint led to a gap of over 30 points between the top eight and bottom three on the main leaderboard.
Sharp Fluctuations Compared to Yesterday
Compared with yesterday's data, Gemini 3.1 Pro's main score dropped by 39.4 points, execution by 47.5 points, and constraint by 29.5 points. Qwen3 Max's main score rose by 29.3 points, but its constraint dropped by 30.7 points. 豆包 Pro's main score rose by 23.1 points, with constraint dropping by 24 points. DeepSeek V4 Pro's main score rose by 16.2 points, with constraint dropping by 39.2 points. Gemini 2.5 Pro's main score dropped by 17.2 points.
Multiple drops of over 30 points occurred in the Material Constraint dimension: Claude Sonnet 4.6 fell by 38.7 points, Claude Opus 4.7 by 38 points, DeepSeek V4 Pro by 39.2 points, and 文心一言 4.5 by 32.5 points. For models that maintained an execution score of 100, the decline in constraint directly pulled down the main leaderboard score.
Direct Impact of Score Structure Differences
The core_overall formula is 0.55 × execution + 0.45 × constraint. For models with an execution score of 100, each 1-point drop in constraint reduces the main score by approximately 0.45 points; for models with an execution score of 50, the weight of constraint impact is relatively higher. Today, the median constraint of the top eight models is about 57 points, and the median constraint of the bottom three models is about 56.8 points, with a 50-point gap in execution between the two groups. Ultimately, the difference in the main leaderboard mainly comes from the execution dimension.
Anomalous signals were concentrated in Material Constraint, with all 11 models experiencing declines in this dimension, of which 10 models dropped by more than 20 points. Although Grok 4 still ranks first, its constraint of 61.3 is already close to the passing line.
A perfect execution score has become the standard, and Material Constraint is becoming the key variable determining rankings.
Today's Smoke data only reflects the results of a single-day 10-question quick test, and the stability dimension was not included in this Lite Evaluation. The focus of subsequent observation will be on the recovery speed of each model's constraint score.
Data source: YZ Index | Run #176 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接