Doubao Pro Scores Zero on Perfect Question: Why AI Models Collectively Fall Silent During Real Security Incidents

Mar 21, 2026 759 Views - Read Source Winzheng Index

豆包Pro 工程判断力安全事件响应 AI测评技术决策

On a security response question that previously earned perfect marks, Doubao Pro submitted a blank answer this time. More disturbing is that while the model's response appeared professional, it exposed a fatal flaw in AI's critical decision-making moments.

From 100 to 0: A Textbook Case of Misjudgment

First, the question background: You're an operations engineer at an e-commerce company. At 3 AM, you receive an alert about abnormal CPU spikes on servers, while discovering unknown processes consuming massive resources. This is a typical security incident scenario, testing whether AI can make correct emergency response decisions.

Doubao Pro's answer looks professional at first glance: "First, do not arbitrarily terminate abnormal processes or restart servers to avoid destroying the intrusion scene or interrupting core business. Immediately collect and preserve complete scene evidence in read-only mode, including process lists, network connections, system logs, abnormal process memory images..."

But why did this answer receive zero points? Because it made a fatal principled error.

Preserve Scene VS Stop Loss First: AI Chose the Wrong Side

In real security incidents, there's an iron rule: damage control always takes priority over evidence collection. When servers are already compromised and abnormal processes might be stealing data, planting backdoors, or using the system as a springboard to attack others, every second of delay could cause irreversible damage.

Doubao Pro's answer violated this principle exactly. It prioritized "scene preservation," suggesting "do not arbitrarily terminate abnormal processes" and to "collect evidence in read-only mode." This academic thinking is catastrophic in real scenarios.

A senior security expert commented: "If my team member was still debating whether to preserve the scene upon discovering an intrusion, I'd immediately remove them from frontline duties. This isn't a CSI crime scene - this is an ongoing cyberattack."

AI's "Curse of Knowledge": Theoretically Correct but Detached from Reality

Analyzing Doubao Pro's response logic, it clearly possesses extensive theoretical knowledge of security response: evidence preservation, process analysis, log collection, memory imaging... These are indeed standard procedures for security incident handling. The problem is that AI didn't understand the priorities and applicable scenarios of these procedures.

The deeper issue is that AI may have over-learned "idealized" security response procedures during training, while lacking understanding of real-world complexity:

What does a 3 AM alert mean? The attacker's timing choice is deliberate
What are an e-commerce company's core assets? User data and payment information breaches are unacceptable
What does an already-running unknown process mean? Defenses have been breached; it's damage control time

Not Just Doubao Pro: Systemic Flaws in AI Decision-Making

This incident reflects current AI's universal problems in critical decision scenarios. According to latest evaluation data, Doubao Pro improved in other dimensions: programming ability increased by 2 points, knowledge work ability surged by 7.9 points, but it stumbled precisely in scenarios requiring situational judgment.

This isn't an isolated case. The trend we observe is: AI grows increasingly strong at tasks with standard answers (programming, knowledge Q&A), but remains fragile in scenarios requiring trade-offs and rapid decisions. This "high scores, low capability" phenomenon should alarm the entire industry.

Engineering Judgment: AI's Final Weakness?

Doubao Pro's failure gives us an important insight: engineering judgment might be the watershed between AI tools and AI assistants. A qualified AI assistant needs not just knowledge, but the ability to make correct decisions under pressure.

From evaluation data, Doubao Pro's stability score is only 48.2, lowest among all dimensions. This indicates the model's performance fluctuates greatly when facing non-standardized, high-pressure scenarios. Today it's security incident response, tomorrow it might be production incident handling, the day after might be business decisions - these scenarios all require not rote knowledge, but living judgment.

Worryingly, if AI continues optimizing along the path of "exam-oriented education," we might get a batch of "high-scoring but incompetent" models: they can perfectly answer standard questions but frequently fail in real-world complex decisions.

When AI faced its moment of truth, it chose to protect evidence rather than protect the system - this mistake might be more common and more dangerous than we imagine.

Data source: YZ Index | Run #33 | View raw data