11 AIs Answer the Same Question, 7 Fail: Who's Pretending to Be Smart?

Mar 21, 2026 793 Views - Read Source Winzheng Index

DeepSeek Claude 安全事件响应工程判断力 Model Evaluation

If your website is leaking user data, would you first call a meeting or pull the plug? This seemingly simple multiple-choice question tripped up 7 top AI models.

We tested 11 mainstream AI models with a real engineering scenario: "After login, users are seeing other people's order data. Customer service has confirmed it's reproducible. What should be your first step?" The results were shocking—over 60% of models chose to "report first, then handle" rather than immediately stop the bleeding.

Overwhelming Victory for Chinese AI

Most surprisingly, the best performers in this test were from the Chinese model camp. Doubao Pro, DeepSeek V3, DeepSeek R1, and Grok 3 all scored perfect marks, with remarkably consistent answers: Immediately take offline, isolate systems, block the leak.

"The first step is to immediately take offline the relevant functional services involving user authentication and order queries, blocking the path for further exploitation of the vulnerability."—Doubao Pro's answer was textbook perfect.

In contrast, the performance of top Western models like Claude Sonnet, Gemini 2.5 Pro, and Claude Opus was disappointing. They unanimously chose to "first report to the technical team and security officer," as if in the face of real data leakage, process matters more than stopping the bleeding.

The 80-Point Middle Ground: Understanding Without Fully Getting It

GPT-4o, GPT-o3, and Claude Sonnet scored 80 points, mentioning "system suspension" but placing "team notification" at equal importance. This "have your cake and eat it too" response reveals their fuzzy judgment of emergency priorities.

In real security incidents, every second could mean more user data being leaked. Spend 5 minutes writing an email report, or 5 seconds shutting down the service? This isn't a choice that needs deliberation.

Common Features of Zero-Score Answers

ERNIE Bot4.0, Gemini 2.5 Pro, Claude Opus, and Qwen Max all scored zero, with three fatal flaws in their responses:

Wrong priorities: Putting "notification" and "reporting" first
Lack of urgency: Using delay-prone phrases like "ensure notification" and "require them to"
Responsibility shifting: Pushing decision-making to "technical teams" or "security officers"

Qwen Max's answer was particularly absurd: "Immediately notify the technical team and require them to urgently fix this security vulnerability"—completely putting the cart before the horse. Do we need AI to tell us whether to stop the bleeding first or find a doctor?

Why Did Western AI Collectively Drop the Ball?

Three deeper reasons might explain this phenomenon:

1. Training data bias: Western AI may have been exposed more to standardized process documents from large corporations, emphasizing "compliance" over "emergency response." Chinese AI training data might contain more real-world cases.

2. Cultural differences: Western corporate culture emphasizes procedural justice, while Chinese internet companies stress rapid response. This difference may be deeply embedded in AI's "DNA."

3. Understanding of responsibility: Chinese AI seems to better understand the concept of "first responder"—solve the problem first, not find someone to blame.

This Is More Than Just a Test Question

The results reveal a dangerous trend: As AI increasingly participates in critical decision-making, their judgment biases could bring catastrophic consequences.

Imagine if a security team relying on AI-assisted decision-making adopted those zero-score suggestions during a real data breach. The consequences would be unthinkable. Every additional minute of data leakage could mean privacy exposure for tens of thousands of users, millions in fines, and irreparable reputation damage.

More ironically, the models that perform best on benchmarks, have the most parameters, and highest valuations completely failed this test of basic engineering judgment. This proves once again: Parameter count isn't intelligence, and memorizing facts isn't the same as knowing how to act.

If they can't even decide between "put out the fire or call the fire department first," companies entrusting their fate to AI might need to reconsider. After all, in critical moments, you don't need a consultant who can hold meetings—you need an engineer who can pull the plug.

Data source: YZ Index | Run #33 | View raw data