GPT-4o Code Execution Plummets 23.7 Points: Version Update Triggers Performance Avalanche

The latest YZ Index evaluation data reveals that GPT-4o's code execution (v5) version has encountered a major performance crisis. In the 100-point evaluation system, the model's code execution capability plummeted from 78.0 to 62.8 points, marking the largest decline in recent records.

Complete Collapse: Six Out of Seven Dimensions Plunge

The problems exposed by this evaluation extend far beyond the single dimension of code execution. Data shows that six out of seven evaluation dimensions experienced significant declines:

  • Cost-effectiveness dimension: Dropped from 79.0 to 24.9, a decline of 54.1 points
  • Stability dimension: Dropped from 80.0 to 27.8, a decline of 52.2 points
  • Knowledge synthesis dimension: Dropped from 79.0 to 47.2, a decline of 31.8 points
  • Material constraints dimension: Dropped from 80.1 to 49.1, a decline of 31.0 points
  • Usability dimension: Dropped from 100.0 to 79.0, a decline of 21.0 points

The overall score plummeted from 81.1 to 49.3 points, with overall performance nearly halved.

Stability Crisis: Severe Deterioration in Response Consistency

The 52.2-point plunge in the stability dimension is particularly concerning. YZ Index's stability score is calculated based on the consistency of model responses, derived by analyzing the standard deviation of multiple answers to the same question. The low score of 27.8 points indicates that GPT-4o v5 exhibits severe inconsistency when handling identical questions.

In practical applications, this instability manifests as: users asking the same programming question multiple times may receive drastically different code implementations, or even logically contradictory answers. For production environments requiring reliability, such performance is undoubtedly catastrophic.

Version Update: Performance Enhancement or Regression?

The version increment from v4 to v5 typically implies feature enhancements or performance optimizations, but GPT-4o's update shows the opposite trend. The simultaneous decline across multiple dimensions suggests this is not a degradation of a single function, but a systematic problem with the model's overall architecture or training strategy.

Possible causes include:

  • Overfitting due to excessive optimization: Sacrificing the model's generalization capability to improve performance in certain specific scenarios
  • Computational resource compression: Reducing computational resource allocation during model inference to lower operational costs
  • Training data contamination: The new version's training may have introduced lower-quality datasets
  • Architectural adjustment errors: Introducing inadequately tested changes during model structure optimization

Industry Impact: Trust Crisis and Choice Dilemma

The cost-effectiveness dropping from 79.0 to 24.9 points means users are paying the same cost but receiving less than one-third of the value in return. This sharp deterioration not only affects individual developers' choices but may also shake enterprise users' confidence in OpenAI's product roadmap.

In the increasingly competitive large model market, this performance avalanche provides a window of opportunity for competitors. The relative advantages of competitors like Claude 3.5 and Gemini are thus highlighted, and users' migration cost considerations will be reassessed.

GPT-4o v5's performance serves as a reminder to the industry: while pursuing model iteration speed, ensuring version quality stability is equally important. If frequent updates are accompanied by dramatic performance fluctuations, what will ultimately be damaged is user trust—the most valuable asset.


Data source: YZ Index | Raw data