数学基准 (2 articles)

OpenAI o1 Model Math Capability Controversy: Hallucination Issues Challenge AI Benchmark Validity

OpenAI's o1-preview model has sparked controversy due to frequent "hallucinations" in complex math problems despite impressive benchmark scores. The incident has triggered over a million interactions on X platform and prompted deep reflection on traditional AI benchmark effectiveness.

OpenAI o1 Model Achieves Mathematical Reasoning Breakthrough: 83% on ARC-AGI, Ushering in the AI Reasoning Era

OpenAI's newly released o1-preview model has achieved remarkable performance on multiple mathematical and coding benchmarks, particularly scoring 83% on ARC-AGI, far exceeding GPT-4o's level. This breakthrough stems from its innovative 'Chain of Thought' mechanism, enabling AI to simulate human step-by-step reasoning processes and tackle complex problems.