11 Models See Collective Plunge in Code Execution Scores, GPT-5.5 Leads Smoke Lightweight List with 95.24 Points

In the YZ Index June 14, 2026 Smoke lightweight evaluation, GPT-5.5 ranked first on the main list with a score of 95.24 (Code Execution 96, Material Constraint 94.3 [pass]), maintaining scores above 90 in both execution and constraint dimensions, forming the most balanced high-score structure.

Execution and Constraint Synergy Determines Ranking

Gemini 3.1 Pro ranked second on the main list with 92.46 points, with a Code Execution score of 97.5—higher than GPT-5.5—but Material Constraint of only 86.3 points, widening the gap in overall score. GPT-o3 also scored 97.5 in execution and 84 in constraint, achieving a main list score of 91.43 points, closely following. The execution score difference among the three is less than 2 points, yet constraint becomes the key factor in ranking.

Claude Opus 4.7 scored only 47.5 in execution and 97.3 in constraint, yielding a main list score of 69.91; Claude Sonnet 4.6 scored 50 in execution and 93 in constraint, with a main list score of 69.35. Both Claude models lead in constraint scores but fall below 50 in execution, revealing a clear weakness in code tasks.

Multiple Models See Execution Scores Drop Sharply

Compared to yesterday, Doudou Pro’s main list score dropped 31.1 points to 59.28, with Code Execution plunging 61.6 points from yesterday’s level to 38.4; Qwen3 Max’s main list also dropped 31.1 points to 52.89, with execution falling 78.3 points to 21.7. DeepSeek V4 Pro’s main list dropped 25.5 points, execution down 61.6 points. These models saw varying degrees of increase in constraint scores, but the magnitude of execution score decline far exceeded the constraint increase, dragging down the main list.

Grok 4’s execution score plunged 19.1 points today, with a main list score of 81.85, dropping to fourth place. Gemini 2.5 Pro’s execution score dropped 45 points, main list 70.53; Claude Opus 4.7’s execution score dropped 52.5 points, main list 69.91. These declines are concentrated in the Code Execution dimension, while Material Constraint scores actually rose.

Possible Causes of Anomalous Signals

Today, 8 out of 11 models saw double-digit drops in their main list scores, all concentrated in execution scores. Most constraint scores increased, indicating that the difficulty of the test materials themselves did not increase; the problem is more likely due to increased difficulty in code execution tasks or differences in model adaptability to new test cases. Qwen3 Max and Doudou Pro’s execution scores have fallen to the 20–40 point range, a stark contrast to yesterday’s high scores.

Ernie Bot 4.5 scored 49.65 on the main list, with execution 21.7 and constraint 83.8, remaining at the bottom. DeepSeek V4 Pro’s constraint score of 90.5 is relatively high in the latter segment, but execution of 38.4 limits its overall ranking.

The structural difference between execution and constraint scores reveals the true boundaries of model capability more effectively than a single total score.

Today’s Smoke data indicates that Code Execution has become the core variable distinguishing model tiers. GPT-5.5, with its synchronous high scores in both execution and constraint, will maintain its leading position in the near term.


Data source: YZ Index (Winzheng Index) | Run #170 | View raw data