GPT-o3, in today's Smoke evaluation, saw its code execution dimension collapse directly from 90.00 to 47.50, with the main score dropping 18 points overall to 58.08. This figure pushes the model into a position where an explanation is required.
The data speaks for itself
Smoke evaluation only includes 10 questions per day, 2 per dimension, so the daily standard deviation is inherently large. However, the code execution dimension dropped 42.5 points in a single day, and the engineering judgment dimension simultaneously fell from 50.00 to 10.00. The combined effect resulted in a net loss of -18 points on the main leaderboard. In contrast, material constraint rose by 12 points, and task expression remained flat, indicating that the issues are concentrated in tasks requiring precise reasoning and multi-step execution.
Sampling fluctuation or true degradation
When looking only at a single day, the topic difficulty sampling remains the most likely explanation. However, with two dimensions experiencing drops of around 40 points, the probability falls below the normal daily fluctuation range. More critically, engineering judgment (side leaderboard, AI-assisted evaluation) also collapsed simultaneously, which typically indicates a significant decline in output consistency in scenarios requiring implicit constraints and trade-offs.
Recently, OpenAI has been in a rapid iteration window for the o-series models. If o3 has entered an internal fine-tuning or distillation phase, the compression of reasoning paths would most easily affect the robustness of code execution first. This is highly consistent with the "cliff-like drop in execution accuracy" observed in this evaluation.
Should we pay special attention?
Yes. Although the Smoke evaluation is a snapshot, when a core capability dimension experiences a daily decline of over 40 points, accompanied by simultaneous deterioration in the engineering judgment dimension, it can no longer be simply attributed to luck. It is recommended to continuously track the same model over the next 3-5 trading days. If the code execution dimension cannot return to above 75 points, it can be basically determined as a real capability regression rather than sampling noise.
Currently, GPT-o3's integrity rating remains at "pass," indicating that there are no obvious hallucinations or boundary-crossing issues. However, this does not constitute protection for its execution capability. Once execution capability degrades, it is difficult to repair quickly through safety alignment in the short term.
42.5 points is not luck; it's a signal.
Data source: YZ Index (YZ Index) | Run #129 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接