In the Smoke lightweight evaluation on June 20, 2026, GPT-5.5's main score dropped from 93 to 72.5 compared to yesterday, its execution score fell directly from 100 to 50, and its constraint score also declined from 115.5 to 100.
Clear Divergence in Execution and Constraint Structure
The top seven models all have execution scores above 98.4, among which Claude Opus 4.7 and Qwen3 Max achieved perfect scores in both execution and constraint. Claude Sonnet 4.6, DeepSeek V4 Pro, 豆包Pro, and GPT-o3 all share an execution score of 100 and a constraint score of 96.7, forming a stable structure of "perfect execution + slight concession in constraint".
The bottom four models exhibit the opposite pattern: GPT-5.5, 文心一言4.5, Gemini 2.5 Pro, and Gemini 3.1 Pro all saw their execution scores drop to 50, while their constraint scores remained at 96.7-100. Under the core_overall formula, the execution weight of 0.55 causes these four models' main scores to be significantly dragged down.
Structural Reasons for the Abnormal Declines of Four Models
Gemini 3.1 Pro's main score fell by 28.3 points, with its execution score dropping 50 points; Gemini 2.5 Pro dropped 25 points, also with a 50-point decline in execution score and a slight 5.5-point drop in constraint score. 文心一言4.5's execution score fell by 44.1 points, resulting in a 22.2-point decline in its main score. GPT-5.5's execution score dropped by 50 points, leading to a 20.5-point decline in its main score.
These declines are all concentrated in the execution dimension, with limited or no changes in the constraint dimension. In the Smoke evaluation's 10 questions, the proportion of execution-type questions directly affects the 0.55 weight coefficient, causing daily scores to plunge by more than 20 points.
Balanced Characteristics of High-Scoring Models
Claude Opus 4.7 and Qwen3 Max tied for first place with a perfect score of 100, both showing no weaknesses in code execution and material constraint. Grok 4, with an execution score of 98.4 and a constraint score of 96.7, achieved a main score of 97.64, ranking seventh, still maintaining a close balance between execution and constraint.
Today's data shows that models with an execution score of 100 occupy five of the top six positions on the main leaderboard, and a constraint score of 96.7 has become the current passing line. No model has fallen below this constraint score yet.
The four models with an execution score of 50 still maintain high constraint scores, indicating that material constraint ability has not collapsed in tandem, and the issue is concentrated on the stability of code execution paths.
A collective drop of 50 points in the execution dimension has pushed the four models from the top six directly to the bottom four on the main leaderboard, with the weight coefficient of 0.55 amplifying this structural rift.
Today's Smoke evaluation only reflects the results of a single day's 10 questions, and the significant fluctuations in execution scores require subsequent multi-day data to verify their persistence.
Data source: YZ Index (YZ Index) | Run #188 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接