GPT-4o Code Execution Plummets 23.7 Points: Version Update Triggers Performance Avalanche

Mar 31, 2026 923 Views - Read Source winzheng.com

GPT-4o Code Execution 性能下降 Model Evaluation 版本更新

The latest YZ Index evaluation data reveals that GPT-4o's code execution (v5) version has encountered a major performance crisis. In the 100-point evaluation system, the model's code execution capability plummeted from 78.0 to 62.8 points, marking the largest decline in recent records.

Complete Collapse: Six Out of Seven Dimensions Plunge

The problems exposed by this evaluation extend far beyond the single dimension of code execution. Data shows that six out of seven evaluation dimensions experienced significant declines:

Cost-effectiveness dimension: Dropped from 79.0 to 24.9, a decline of 54.1 points
Stability dimension: Dropped from 80.0 to 27.8, a decline of 52.2 points
Knowledge synthesis dimension: Dropped from 79.0 to 47.2, a decline of 31.8 points
Material constraints dimension: Dropped from 80.1 to 49.1, a decline of 31.0 points
Usability dimension: Dropped from 100.0 to 79.0, a decline of 21.0 points

The overall score plummeted from 81.1 to 49.3 points, with overall performance nearly halved.

Stability Crisis: Severe Deterioration in Response Consistency

The 52.2-point plunge in the stability dimension is particularly concerning. YZ Index's stability score is calculated based on the consistency of model responses, derived by analyzing the standard deviation of multiple answers to the same question. The low score of 27.8 points indicates that GPT-4o v5 exhibits severe inconsistency when handling identical questions.

In practical applications, this instability manifests as: users asking the same programming question multiple times may receive drastically different code implementations, or even logically contradictory answers. For production environments requiring reliability, such performance is undoubtedly catastrophic.

Version Update: Performance Enhancement or Regression?

The version increment from v4 to v5 typically implies feature enhancements or performance optimizations, but GPT-4o's update shows the opposite trend. The simultaneous decline across multiple dimensions suggests this is not a degradation of a single function, but a systematic problem with the model's overall architecture or training strategy.

Possible causes include:

Overfitting due to excessive optimization: Sacrificing the model's generalization capability to improve performance in certain specific scenarios
Computational resource compression: Reducing computational resource allocation during model inference to lower operational costs
Training data contamination: The new version's training may have introduced lower-quality datasets
Architectural adjustment errors: Introducing inadequately tested changes during model structure optimization

Industry Impact: Trust Crisis and Choice Dilemma

The cost-effectiveness dropping from 79.0 to 24.9 points means users are paying the same cost but receiving less than one-third of the value in return. This sharp deterioration not only affects individual developers' choices but may also shake enterprise users' confidence in OpenAI's product roadmap.

In the increasingly competitive large model market, this performance avalanche provides a window of opportunity for competitors. The relative advantages of competitors like Claude 3.5 and Gemini are thus highlighted, and users' migration cost considerations will be reassessed.

GPT-4o v5's performance serves as a reminder to the industry: while pursuing model iteration speed, ensuring version quality stability is equally important. If frequent updates are accompanied by dramatic performance fluctuations, what will ultimately be damaged is user trust—the most valuable asset.

Data source: YZ Index | Raw data

GPT-4o Code Execution Plummets 23.7 Points: Version Update Triggers Performance Avalanche

Complete Collapse: Six Out of Seven Dimensions Plunge

Stability Crisis: Severe Deterioration in Response Consistency

Version Update: Performance Enhancement or Regression?

Industry Impact: Trust Crisis and Choice Dilemma

Related Reviews

Winzheng Index GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

Winzheng Index GPT-o3 Smoke Evaluation Main Leaderboard Plunges 8.3 Points, Code Execution Drops from 100 to 88.3

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Qwen3 Max Main Score Plunges 14.9 Points, Code Execution Drops from 96.9 to 65.6