When your AI assistant tells you "please wait 632 milliseconds before retrying" five times in a row, you know something's seriously wrong. This isn't science fiction, but GPT-o3's actual performance in this week's long context evaluation.
The Most Embarrassing Test Failure in History
The latest AI model evaluation results from Winzheng are jaw-dropping: GPT-o3's long context score crashed from 62.3 to 28.8 points, plummeting by 33.5 points. Even more absurd, all 5 core questions failed for the exact same reason—API rate limiting.
Let's look at these failed questions: Root Cause Analysis and Evidence Boundaries, Breaking Changes List, Customer Migration Risk Assessment, Cost Change Calculation, and High-Quality Growth Analysis. Each was a crucial test of the model's ability to handle complex long texts, yet all returned this error message:
Rate limit reached for gpt-4o in organization org-5kL87cAHHWwzzzRXfZoA5jZm on tokens per min (TPM): Limit 30000, Used 29516, Requested 800.
Note this detail: 30,000 token limit, 29,516 used, 800 requested. This means GPT-o3 couldn't even handle a 800-token margin.
This Is More Than Just a Technical Glitch
On the surface, this appears to be a simple API rate limiting issue. But analyzing the raw logs reveals more serious problems:
- 5 failures occurred within an extremely short timeframe, with the shortest interval being just 140 milliseconds
- Each request's token count was between 600-800, well within normal range
- Post-limit retry times varied randomly from 408 milliseconds to 1.126 seconds
This exposes three fatal flaws in OpenAI's infrastructure:
First, the token calculation mechanism has serious bugs. When usage approaches the limit (98.4%), the system cannot accurately predict remaining capacity, causing normal requests to be rejected.
Second, the rate limiting strategy is overly aggressive. In enterprise-grade API services, there should be buffer mechanisms when usage approaches limits, not outright service denial.
Third, error recovery mechanisms are virtually non-existent. The randomness of retry times indicates the system lacks any reasonable queuing mechanism.
The Truth About Long Context Capabilities
More ironically, just last week OpenAI was heavily promoting GPT-o3's long context processing capabilities. Now it seems when you actually need to process long texts, it might not even let you through the door.
This incident reveals a harsh truth: No matter how capable the model, if the infrastructure can't keep up, everything is meaningless. Especially in scenarios requiring large token processing for long contexts, API stability is more critical than the model's capabilities themselves.
According to the evaluation data, GPT-o3's stability score dropped from 53.0 to 28.0, and availability fell from 100% to 69%. This means in practical use, 1 out of every 3 calls might fail. For any serious business application, such availability is completely unacceptable.
OpenAI's Infrastructure Debt
This incident wasn't accidental. Over the past few months, OpenAI's API service has frequently experienced various issues: response delays, service interruptions, and rate limiting anomalies. Each time it's been patched up without truly addressing the root problems.
The reason is simple: OpenAI has invested too many resources in model training while neglecting service infrastructure development. When user volume grows exponentially, these technical debts explode all at once.
Interestingly, in programming capability tests, GPT-o3's score actually improved by 23.2 points. This shows the model's capabilities aren't the issue—the problem lies in delivery. It's like buying a Ferrari only to find the car key frequently malfunctions.
A Wake-up Call for Developers
For developers using or planning to use GPT-o3, this incident provides several important lessons:
- Don't over-rely on a single API for critical business processes
- Implement comprehensive fallback and retry mechanisms
- When processing long texts, consider chunked processing instead of single submissions
- Monitor API usage and proactively control it before hitting limits
When AI giants can't even guarantee basic API stability, should we rethink our definition of "advanced AI"? While pursuing larger parameters and stronger capabilities, shouldn't we first complete the fundamental course of infrastructure?
Next time OpenAI releases a new model, rather than parameter counts and benchmark scores, I'm more concerned about: How long can it run stably without errors? After all, an intermittently brilliant genius is not as good as a consistently reliable ordinary person.
Data source: YZ Index | Run #37 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接