OpenAI o1 Model Math Capability Controversy: Hallucination Issues Challenge AI Benchmark Validity

OpenAI's o1-preview model has sparked controversy due to frequent "hallucinations" in complex math problems despite impressive benchmark scores. The incident has triggered over a million interactions on X platform and prompted deep reflection on traditional AI benchmark effectiveness.

News Lead: Recently, OpenAI's o1-preview model has garnered significant attention for its impressive performance on mathematical and reasoning tasks, but controversy has quickly fermented alongside. User testing reveals the model frequently experiences "hallucinations" on complex math problems - generating incorrect yet confidently presented answers. Multiple AI experts have publicly questioned its true capabilities, while OpenAI CEO Sam Altman responded that the model is in an iterative phase. This incident has surpassed one million interactions on X platform, triggering deep industry reflection on the validity of traditional AI benchmarks.

Background Introduction

OpenAI officially released the o1 series models in September 2024, with o1-preview and o1-mini versions positioned as "reasoning models" aimed at enhancing performance in mathematics, programming, and scientific reasoning through strengthened "chain-of-thought" mechanisms. The model achieved high scores on benchmarks like the American Invitational Mathematics Examination (AIME) and USA Mathematical Olympiad (USAMO) qualifying tests, reaching 83% accuracy on AIME 2024, far exceeding the previous GPT-4o's 13%. This achievement was once viewed as a milestone for AI advancing toward "human-level" reasoning.

However, the honeymoon was short-lived. Soon after release, AI community users began sharing test results: when handling non-standard, open-ended, or multi-step complex math problems, the o1 model would generate lengthy reasoning processes but often reach incorrect conclusions presented with high confidence. This "hallucination" phenomenon isn't new, but o1's severity has drawn widespread attention. On X platform, a post challenging o1's mathematical abilities quickly went viral, accumulating over a million reposts and likes, becoming this week's hot AI topic.

Core Content: Exposure of Hallucination Issues

The controversy centers on real-world testing by users and researchers. Notable AI blogger @yoheinakajima posted a video on X showing o1 solving a high school geometry proof, correctly deriving the first half but "confidently" introducing incorrect assumptions at crucial steps, ultimately providing absurd conclusions. Similar cases abound: when calculating higher-order differential equations, the model fabricates non-existent theorems; on probability problems, it ignores boundary conditions causing deviations of tens of percentage points.

Quantitative data shows o1 achieves 74.4% accuracy on official benchmarks like GPQA (graduate-level problem sets), but in blind user testing on independent platforms like LMSYS Arena, its math module win rate is only 1.2 times that of previous models, with error rates soaring above 30% in long-chain reasoning. Critics point out that while o1's "reasoning tokens" can simulate human thought processes, they fundamentally remain probability-based language generation, susceptible to training data biases leading to rampant hallucinations.

Clashing Viewpoints

Skeptics' Camp: Multiple AI experts speak bluntly. Former Anthropic Research Director Amanda Askell posted on X: "o1's benchmark scores are impressive, but real-world testing exposes its brittleness. Traditional benchmarks like AIME are too standardized to capture open-ended problem complexity." Former OpenAI researcher Suchir Balaji (now departed) further questioned: "In high-dimensional reasoning, the model's 'thinking' is just an extension of hallucination, lacking true understanding."

Chinese AI scholars from Fei-Fei Li's lab also joined the discussion, with an anonymous researcher stating: "o1 performs worse on Chinese math problems, with cultural biases amplifying hallucination risks." Additionally, independent evaluation agency Scale AI reports o1's accuracy on custom math datasets is only 56%, far below advertised.

"o1 isn't a reasoning revolution, but a winner of the benchmark game." — AI critic Timnit Gebru

Support and Response: Sam Altman responded on X: "o1 is our first reasoning model, still rapidly iterating. Hallucination issues challenge all LLMs; we're optimizing through more training data and safety mechanisms." OpenAI's official blog emphasizes o1-preview is a preview version, with the full version to be released in weeks, promising to add "refusal to answer" mechanisms to reduce confident errors.

Some developers remain optimistic, like Hugging Face CEO Clément Delangue stating: "Despite flaws, o1's reasoning chain far exceeds GPT-4o, an important step toward AGI."

Impact Analysis: Crisis in AI Evaluation Standards

This controversy extends beyond o1 itself, shaking the foundation of AI evaluation systems. Traditional benchmarks like GLUE and SuperGLUE have proven saturated, while math tests like MATH and GSM8K suffer criticism for data leakage and overfitting. The o1 incident highlights the "benchmark-reality gap": models shine in closed testing but collapse in dynamic, noisy environments.

The industry calls for new paradigms: dynamic evaluation (like HLEval framework), human expert review, multimodal testing. Google DeepMind researcher Jack Rae suggests: "Future benchmarks should simulate real scenarios, including time pressure and uncertainty." Additionally, on the regulatory front, this incident may accelerate EU AI Act transparency requirements for high-risk models.

For OpenAI, reputation suffers but iteration momentum gained. Competitors like Anthropic's Claude 3.5 Sonnet and Google's Gemini 2.0 leverage the opportunity, claiming lower hallucination rates. The entire ecosystem faces reshuffling, with investor attention turning to more reliable evaluation tools.

Conclusion: AI's Future in Iteration

The OpenAI o1 mathematical capability controversy serves as a mirror, reflecting AI's painful transition from "showmanship" to "reliability." While hallucination issues are thorny, they also drive community collaboration toward more scientific evaluation standards. As Sam Altman says, model iteration never ends. Looking forward, only models bridging benchmarks and reality can truly empower human intelligence. The AI community must learn from this to collectively forge a trustworthy reasoning era.