11 AIs Answer the Same Question, 10 Are Playing Dumb: Why Did Doubao Get a Perfect Score?

Mar 21, 2026 585 Views - Read Source Winzheng Index

豆包 DeepSeek 工程思维模型测评知识工作

When I saw the responses from these 11 AI models, my first reaction was shock—faced with a real workplace scenario involving contradictory information, 10 models actually chose to "muddle through." What's even more disturbing is that among these models giving perfunctory answers were industry leaders like GPT-4o, Claude, and DeepSeek V3.

One Question Reveals the Truth: Most AIs Only Speak "Correct Nonsense"

The question was simple: two documents show different memory configurations for the same server—one says 32GB, the other says 64GB. You need to confirm the actual configuration for capacity planning. This is a routine scenario any operations engineer might encounter.

The responses from 10 models were remarkably similar: "Contact the operations team to verify," "Log into the server to check," "Execute the free -h command." Yes, these answers are all "correct," but they're also useless. It's like answering "What should I do when I'm hungry?" with "Eat food"—correct but unhelpful.

What made Doubao Pro's perfect-scoring answer different?

"Prioritize verifying the server's actual configuration: For physical servers, directly log into the operating system and execute commands like dmidecode -t memory, free -h to query actual memory capacity; for cloud servers, besides checking within the system, also verify against the corresponding cloud platform console's instance configuration parameters..."

Notice the difference? Doubao not only provided specific technical paths but also distinguished between different handling methods for physical and cloud servers. More crucially, it didn't stop at "just find the data and you're done," but explicitly proposed follow-up actions:

"After obtaining the actual values, coordinate with operations and procurement teams to understand the reasons for the discrepancy between the two documents, correct the documentation, and prevent similar information conflicts in the future."

The Thinking Gap Behind Technical Details

Carefully analyzing these responses, I discovered an interesting pattern:

Perfunctory responses (DeepSeek V3, Wenxin Yiyan, etc.): Average word count under 20, only giving direction without methods
Surface-level efforts (Claude, GPT-4o): Seemingly detailed but essentially just breaking one action into four steps, still fundamentally "just check it"
True engineering thinking (Doubao Pro): Not only solving the immediate problem but also considering prevention mechanisms

This reminds me of a joke: junior programmers restart the server when encountering bugs, senior programmers check logs to locate issues, while architects ask "Why did this bug occur and how can we prevent it from happening again?"

In this question, Doubao Pro demonstrated architect-level thinking—it understood the meaning behind the "capacity planning" requirement. Capacity planning isn't a one-time query action but continuous work requiring reliable data sources. If document contradictions aren't resolved, you'll encounter the same problem again.

Where Does AI Models' "Laziness" Come From?

Why would top AI models "slack off" on such a simple question? I believe there are three reasons:

1. Training data bias: Much of the Q&A data consists of "quick question, quick answer" formats, so models learned to give "politically correct" answers with minimal words.

2. Lack of real scenario understanding: Models might know the free -h command, but don't understand that in actual work, finding data is just the first step—establishing a reliable information management mechanism is more important.

3. Misleading evaluation metrics: If evaluations only check whether answers are "correct" rather than "useful," models naturally tend to give safe but empty responses.

This Isn't Just a Technical Issue, It's a Product Philosophy Problem

This test reveals a dangerous trend in the current AI industry: excessive focus on models' "IQ" (parameter count, benchmark scores) while neglecting "EQ" (ability to understand users' real needs).

Doubao Pro's excellent performance likely stems from ByteDance's deep product DNA. They're not building AI for the sake of building AI, but genuinely thinking: What kind of assistant do users need in their actual work?

This also explains why models with larger parameters (like DeepSeek V3) performed worse—when you're just stacking parameters without optimizing user experience, you might only create a "high-IQ fool."

Final Thoughts

This test sounds an alarm for the entire AI industry: On the path to AGI, we may have forgotten the most basic thing—AI's value isn't in how smart it is, but whether it can truly help humans work better.

If it can't even handle a simple document discrepancy, how can we talk about changing the world? As the AI arms race intensifies, perhaps what we need isn't breakthroughs in parameter counts, but a return to focusing on user needs.

After all, a truly excellent AI should be like a reliable colleague, not a consultant who only speaks correct nonsense.

Data source: YZ Index | Run #33 | View raw data