GPT-o3 Code Execution Plummets 42.5 Points, Main Score Drops 18 Points in a Day

May 24, 2026 487 Views - Read Source Winzheng Index

GPT-o3 Code Execution Smoke Test Model Fluctuations OpenAI

GPT-o3 Code Execution Plummets 42.5 Points, Main Score Drops 18 Points in a Day

GPT-o3, in today's Smoke evaluation, saw its code execution dimension collapse directly from 90.00 to 47.50, with the main score dropping 18 points overall to 58.08. This figure pushes the model into a position where an explanation is required.

The data speaks for itself

Smoke evaluation only includes 10 questions per day, 2 per dimension, so the daily standard deviation is inherently large. However, the code execution dimension dropped 42.5 points in a single day, and the engineering judgment dimension simultaneously fell from 50.00 to 10.00. The combined effect resulted in a net loss of -18 points on the main leaderboard. In contrast, material constraint rose by 12 points, and task expression remained flat, indicating that the issues are concentrated in tasks requiring precise reasoning and multi-step execution.

Sampling fluctuation or true degradation

When looking only at a single day, the topic difficulty sampling remains the most likely explanation. However, with two dimensions experiencing drops of around 40 points, the probability falls below the normal daily fluctuation range. More critically, engineering judgment (side leaderboard, AI-assisted evaluation) also collapsed simultaneously, which typically indicates a significant decline in output consistency in scenarios requiring implicit constraints and trade-offs.

Recently, OpenAI has been in a rapid iteration window for the o-series models. If o3 has entered an internal fine-tuning or distillation phase, the compression of reasoning paths would most easily affect the robustness of code execution first. This is highly consistent with the "cliff-like drop in execution accuracy" observed in this evaluation.

Should we pay special attention?

Yes. Although the Smoke evaluation is a snapshot, when a core capability dimension experiences a daily decline of over 40 points, accompanied by simultaneous deterioration in the engineering judgment dimension, it can no longer be simply attributed to luck. It is recommended to continuously track the same model over the next 3-5 trading days. If the code execution dimension cannot return to above 75 points, it can be basically determined as a real capability regression rather than sampling noise.

Currently, GPT-o3's integrity rating remains at "pass," indicating that there are no obvious hallucinations or boundary-crossing issues. However, this does not constitute protection for its execution capability. Once execution capability degrades, it is difficult to repair quickly through safety alignment in the short term.

42.5 points is not luck; it's a signal.

Data source: YZ Index (YZ Index) | Run #129 | View raw data

GPT-o3 Code Execution Plummets 42.5 Points, Main Score Drops 18 Points in a Day

The data speaks for itself

Sampling fluctuation or true degradation

Should we pay special attention?

Related Reviews

Winzheng Index Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day

Winzheng Index 豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

Winzheng Index GPT-o3 Material Constraint Drops 16.8 Points, Task Expression Falls 28.3 Points

Winzheng Index Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day