GPT-4o Crashes: 5 Failed Tests Expose OpenAI's Infrastructure Crisis

When I first saw GPT-4o's latest benchmark data, my initial reaction was that the testing system had a bug. But after carefully examining the raw logs, I realized this was a problem far more serious than model capability degradation—OpenAI's infrastructure is on the brink of collapse.

This isn't alarmist talk. In the long-context test, GPT-4o returned the exact same error for all 5 questions: "Rate limit reached for gpt-4o in organization org-5kL87cAHHWwzzzRXfZoA5jZm on tokens per min (TPM): Limit 30000". What does this mean? It means that even OpenAI's own test account cannot properly complete a standard long-text analysis task.

More Than Just Scores Collapsing

The data is brutal: long-context scores plummeted from 62.3 to 40.4, a drop of 35.2%. Even worse is the stability metric, falling from 52.8% to 32.2%, while availability crashed from 100% to 65%. This can no longer be explained away as "performance fluctuation"—this is a systemic collapse.

Looking at the 5 questions that completely failed: root cause analysis, Breaking Changes list, cost calculation, growth analysis, board meeting topics—all high-value tasks requiring deep understanding of long texts. And GPT-4o's performance? It was throttled by its own rate limiting system before it could even finish reading the questions.

The most ironic part? The error message's "Please try again in 824ms"—can't even wait 1 second, what level of resource scarcity is this?

30000 TPM: An Embarrassingly Low Number

Let's do the math. What does 30000 tokens per minute mean? Using GPT-4's tokenizer, it's roughly equivalent to processing 20,000 Chinese characters per minute. For a model that claims to revolutionize knowledge work, this limit is simply a joke.

A standard corporate annual report easily exceeds 100,000 characters, and a software project's codebase can surpass millions of tokens. If even basic document analysis gets rate-limited, what's the point of GPT-4o's "long-context capability"?

Even more absurd is that each of these failed requests only needed 500-800 tokens—not even 1K. This indicates the system is already operating at its limits, where any minor request could be the straw that breaks the camel's back.

OpenAI's Computing Power Dilemma

This incident doesn't expose GPT-4o's capability issues, but rather OpenAI's deeper dilemma:

  • Imbalance between user growth and infrastructure: ChatGPT monthly active users have exceeded 200 million, but backend resources clearly can't keep up
  • The cost control dilemma: Either throttle and sacrifice user experience, or burn money on expansion and wreck finances
  • Technical debt coming due: Infrastructure debt from rapid iteration is now erupting all at once

Interestingly, during this collapse, programming capability actually improved by 29.2 points. What does this tell us? It suggests OpenAI might be reallocating resources, prioritizing short-text, high-frequency scenarios at the expense of long-text processing capabilities.

This Is Just the Beginning

If you think this is just an isolated technical glitch, you're being naive. The organization ID in the error message (org-5kL87cAHHWwzzzRXfZoA5jZm) suggests this is likely OpenAI's internal or important partner test account. If they can't even guarantee VIP service, imagine the experience for regular users.

The deeper issue is: when model capability improvements outpace infrastructure expansion, collapse is inevitable. GPT-4o's parameter count and computational complexity have significantly increased compared to GPT-4, but OpenAI's GPU cluster expansion clearly hasn't kept pace.

This reminds me of 2022 when ChatGPT first went viral, and OpenAI CEO Sam Altman apologized on Twitter: "We're working to add more capacity." Two years later, the capacity problem hasn't been solved—it's gotten worse.

A Warning to the Industry

This incident sounds the alarm for all AI companies:

  • Don't obsess over "bigger and stronger"—without infrastructure to support it, even the most powerful model is a castle in the air
  • Long context is AI's litmus test—fail to handle it well and you're just selling dreams
  • Stability and availability are the foundation of commercialization—fancy tricks can't save a product

When the tide goes out, we don't just see who's swimming naked—we see whose pool has run dry. GPT-4o's collapse exposes the AI industry's biggest lie: we're much further from truly usable AI than we imagine.


Data source: YZ Index | Run #37 | View raw data