GPT-o3 Performance Plummets: Technical Concerns Behind 12.1-Point Drop in Knowledge Work Capabilities

This week, GPT-o3 exhibited severe performance degradation in the knowledge work dimension, with scores plummeting from 82.4 to 70.3 points—a decline of 12.1 points. This anomalous performance was concentrated in two core capabilities: logical reasoning and language comprehension, raising deep concerns about model stability.

Severe Degradation in Logical Reasoning

The most typical case is the "scheduling conflict" problem, where GPT-o3's score dropped directly from a perfect 100 to 10 points. This problem required logical reasoning based on given constraints to determine the scheduling arrangement for five employees. GPT-o3's answer was:

Monday: E
Tuesday: A
Wednesday: C
Thursday: B
Friday: D

This answer exposed serious deficiencies in the model's handling of constraint satisfaction problems. Correct logical reasoning requires simultaneous consideration of multiple constraints and systematic elimination, but GPT-o3 appeared to perform only simple sequential assignment, completely ignoring the conflict verification requirements in the problem.

Significant Decline in Language Understanding Precision

In translation tasks, GPT-o3 similarly performed poorly. The "Legal Terms English-to-Chinese Translation" score dropped from 100 to 75 points. While the translation results were basically accurate, there were detailed deviations in handling professional terminology:

Limitation of Liability: Neither party shall be liable for any indirect, incidental, consequential, special, or punitive damages arising out of or related to this agreement...

More notably, the "Colloquial English-to-Chinese Translation" score dropped from 85.7 to 71.4. While GPT-o3's translation retained the colloquial features of the original text, there were obvious deficiencies in tone grasp and cultural adaptation, lacking flexible handling of idioms like "game-changer."

Concerning System Performance Stability

Data shows that GPT-o3's stability score dropped from 51.3 to 43.1, a decline of 8.2 points. This fluctuation not only manifested in single dimensions but showed systematic characteristics: programming ability slightly decreased by 0.9 points, long context processing capability dropped by 1.8 points, and usability also declined by 1.1 points.

Analysis of Possible Technical Causes

Based on evaluation data, this performance decline may stem from the following technical factors:

1. Improper Model Weight Adjustment: OpenAI may have inadvertently weakened the weight allocation for logical reasoning modules while optimizing other capabilities, leading to poor performance on tasks requiring rigorous reasoning.

2. Training Data Contamination: Recent incremental training may have introduced low-quality data, particularly in logical reasoning and professional translation domains, causing model performance degradation.

3. Side Effects of Inference Optimization: To improve response speed (cost-effectiveness score also dropped by 1.9 points), more aggressive inference optimization strategies may have been adopted, sacrificing some accuracy.

4. Context Window Management Issues: The 1.8-point decline in long context scores indicates that the model's attention mechanism when processing complex information may have deteriorated.

Industry Impact and Outlook

As a mainstream large model, GPT-o3's substantial decline in knowledge work capabilities poses a direct threat to enterprise applications relying on this model. Particularly in scenarios requiring precise logical reasoning, such as project management and legal document processing, such performance fluctuations could lead to serious business risks.

From a technical evolution perspective, this incident once again highlights the engineering challenges faced by large models in pursuing multi-objective optimization. How to maintain the stability of existing capabilities while improving new ones remains a core problem that the entire industry needs to solve. It is recommended that relevant enterprises establish comprehensive model performance monitoring mechanisms when deploying critical business applications and prepare necessary degradation plans.


Data source: YZ Index | Raw Data | YZ Index Home