GPT-5.5 Execution Score Plummets to 50; Gemini 3.1 Pro Main Score Drops 28.3 Points

Jun 20, 2026 22 Views - Read Source Winzheng Index

GPT-5.5 Code Execution Smoke Test 执行约束失衡主榜暴跌

In the Smoke lightweight evaluation on June 20, 2026, GPT-5.5's main score dropped from 93 to 72.5 compared to yesterday, its execution score fell directly from 100 to 50, and its constraint score also declined from 115.5 to 100.

Clear Divergence in Execution and Constraint Structure

The top seven models all have execution scores above 98.4, among which Claude Opus 4.7 and Qwen3 Max achieved perfect scores in both execution and constraint. Claude Sonnet 4.6, DeepSeek V4 Pro, 豆包Pro, and GPT-o3 all share an execution score of 100 and a constraint score of 96.7, forming a stable structure of "perfect execution + slight concession in constraint".

The bottom four models exhibit the opposite pattern: GPT-5.5, 文心一言4.5, Gemini 2.5 Pro, and Gemini 3.1 Pro all saw their execution scores drop to 50, while their constraint scores remained at 96.7-100. Under the core_overall formula, the execution weight of 0.55 causes these four models' main scores to be significantly dragged down.

Structural Reasons for the Abnormal Declines of Four Models

Gemini 3.1 Pro's main score fell by 28.3 points, with its execution score dropping 50 points; Gemini 2.5 Pro dropped 25 points, also with a 50-point decline in execution score and a slight 5.5-point drop in constraint score. 文心一言4.5's execution score fell by 44.1 points, resulting in a 22.2-point decline in its main score. GPT-5.5's execution score dropped by 50 points, leading to a 20.5-point decline in its main score.

These declines are all concentrated in the execution dimension, with limited or no changes in the constraint dimension. In the Smoke evaluation's 10 questions, the proportion of execution-type questions directly affects the 0.55 weight coefficient, causing daily scores to plunge by more than 20 points.

Balanced Characteristics of High-Scoring Models

Claude Opus 4.7 and Qwen3 Max tied for first place with a perfect score of 100, both showing no weaknesses in code execution and material constraint. Grok 4, with an execution score of 98.4 and a constraint score of 96.7, achieved a main score of 97.64, ranking seventh, still maintaining a close balance between execution and constraint.

Today's data shows that models with an execution score of 100 occupy five of the top six positions on the main leaderboard, and a constraint score of 96.7 has become the current passing line. No model has fallen below this constraint score yet.

The four models with an execution score of 50 still maintain high constraint scores, indicating that material constraint ability has not collapsed in tandem, and the issue is concentrated on the stability of code execution paths.

A collective drop of 50 points in the execution dimension has pushed the four models from the top six directly to the bottom four on the main leaderboard, with the weight coefficient of 0.55 amplifying this structural rift.

Today's Smoke evaluation only reflects the results of a single day's 10 questions, and the significant fluctuations in execution scores require subsequent multi-day data to verify their persistence.

Data source: YZ Index (YZ Index) | Run #188 | View raw data

GPT-5.5 Execution Score Plummets to 50; Gemini 3.1 Pro Main Score Drops 28.3 Points

Clear Divergence in Execution and Constraint Structure

Structural Reasons for the Abnormal Declines of Four Models

Balanced Characteristics of High-Scoring Models

Related Reviews

Winzheng Index 11 Models See Collective Plunge in Code Execution Scores, GPT-5.5 Leads Smoke Lightweight List with 95.24 Points

Winzheng Index GPT-5.5 Tops Smoke Chart with Material Constraint Score of 71, All Models Get Full Code Score but Gap Widens in Second Half

Winzheng Index GPT-5.5 Smoke Mainboard Drops 20.5 Points, Code Execution Falls from 100 to 50

Winzheng Index 文心一言4.5 Main Leaderboard Plunges 10.4 Points, Task Expression Dimension Halved from 90 to 46.3