GPT-4o Crashes: The Strict Mode Trap Behind a 35-Point Plunge

Mar 22, 2026 728 Views - Read Source Winzheng Index

GPT-4o 可用性测试严格模式工具调用性能下降

GPT-4o has just experienced a catastrophic performance collapse. In the latest evaluation, its usability score plummeted directly from a perfect 100 to 65, a staggering 35-point drop. Even more shocking is that in some critical tests, its performance can only be described as "total annihilation."

This is not an ordinary performance fluctuation, but a systematic capability degradation.

The Core of the Collapse: When AI Meets "Strict Mode"

The root of the problem is surprisingly simple: strict tool calling. This was originally a new feature introduced by OpenAI to improve model reliability, requiring the model to strictly follow predefined parameter formats when calling tools. Sounds reasonable, right?

But the actual effect was counterproductive. In usability tests, when the model was asked to "only perform operations when absolutely certain," GPT-4o chose the most conservative strategy—simply doing nothing at all.

Here's how it manifests: Faced with a simple file operation request like "create a file named test.txt," GPT-4o would respond: "I need more information to perform this operation. What content would you like to write in the file? Which directory should the file be saved in?"

Seemingly cautious, but actually absurd. It's like asking your assistant to turn on the lights, and they respond: "How many lumens of illumination do you need? What's your color temperature preference in Kelvin? Should we consider energy-saving factors?"

The Data Doesn't Lie: Comprehensive Performance Degradation

Let's look at the specific data:

Usability: 100 → 65 (-35 points)
Long Context Handling: 62.3 → 40.4 (-21.9 points)
Stability: 52.8 → 32.2 (-20.6 points)

The only bright spot is programming ability, which improved from 19.6 to 48.8, an increase of 29.2 points. But this feels more like irony—when the model completely fails at actual tool calling, it performs better at theoretical programming problems.

More notably, cost-effectiveness barely improved (only 0.8 points increase), meaning users received no cost compensation for the performance degradation.

Technical Essence: The Consequences of Over-Engineering

This incident exposes a key problem in current AI development: the danger of over-optimizing for a single metric.

OpenAI clearly wanted to reduce model "hallucinations" and erroneous outputs through strict mode. From an engineering perspective, this approach makes sense—if uncertain, don't guess. But they overlooked a basic fact: absolute certainty doesn't exist in the real world.

Human intelligence is useful precisely because we can make reasonable judgments with incomplete information. When you say "order me a pizza," a normal person will infer what you probably want based on common sense, rather than descending into philosophical questioning.

But GPT-4o's new version has gone to the opposite extreme. It has become an overly cautious bureaucratic machine, prioritizing "not making mistakes" over "being useful."

Deeper Concerns: This Might Not Be a Bug, But a Feature

Most unsettling is that this degradation might be intentional.

As AI capabilities grow stronger, safety concerns become increasingly prominent. OpenAI might be trying to reduce risks by limiting model autonomy. But this "better to do nothing than to do wrong" strategy essentially sacrifices practicality for an illusory sense of safety.

It's like installing a 20 km/h speed limiter on a sports car and saying "look, it's much safer now." Technically it is safer, but what's the point of such a sports car?

"When AI starts questioning every common-sense judgment, it's no longer a tool, but a burden."

Industry Impact: The Beginning of a Trust Crisis

The impact of this incident far exceeds a technical glitch. It shakes the entire industry's confidence in the fundamental assumption of "continuous progress."

For the past two years, we've grown accustomed to seeing leaps in model capabilities every few months. But now, for the first time, we clearly see: model capabilities can regress, and regress significantly.

For enterprises that have already integrated GPT-4o into production environments, this is a nightmare. Imagine your customer service bot suddenly responding to every user request with "I need more information," or your coding assistant suddenly refusing to perform any file operations.

Worse still, OpenAI seems to have rolled out this change without adequate testing and warning. This "deploy first, ask questions later" approach is depleting user trust.

Final Thoughts

GPT-4o's collapse essentially reflects a fundamental contradiction in AI development: we want AI that's as flexible as humans, but we're building systems more mechanical than machines.

When models are trained to "absolutely follow rules," they lose intelligence's most valuable trait—finding certainty in ambiguity, creating order from chaos.

My prediction: OpenAI will roll back this update within 72 hours. But the more important question remains—on the path to AGI, are we creating increasingly unintelligent "intelligence"?

When AI loses the courage to make mistakes, it also loses the ability to truly help humans. This might be the biggest bug of all.

Data source: YZ Index | Run #37 | View raw data