11 AIs Answer Same Question: Doubao Scores 100, 8 Models Score 0

When I threw the same engineering judgment question at 11 AI models, the results left me stunned: Doubao Pro scored a perfect 100, while 8 models including Claude and GPT-4o all scored 0. This wasn't about miscalculating math problems—they simply failed to grasp the essence of the question.

One Question Reveals a World of Difference

The question was simple: Your mass messaging feature is malfunctioning, but sending to individual accounts works fine. What should be your first step?

This is a classic scenario every backend engineer encounters—batch operations fail while individual operations succeed. The key information is clear: individual sending works, mass sending fails. This means the issue must lie in the batch processing logic.

Doubao Pro's answer was textbook perfect:

"The first step should be to collect context information from the user's mass sending operation: including the selected account combinations, mass message content, specific failure manifestations (complete/partial failure, any error messages), operation time, while retrieving backend operation logs from the corresponding period. Priority should be given to checking anomalies in mass-sending-specific logic like batch scheduling, multi-account validation, and bulk content processing, since normal individual sending has already ruled out issues with single account sending interfaces and permissions."

Note Doubao's approach: first collect specific operation context, not just vaguely "check logs"; explicitly identify the need to examine mass-sending-specific logic, because individual sending has already validated basic functionality. This is true engineering thinking.

The Collective Failure of 8 Models

Now look at what those 8 zero-scoring models suggested:

  • Wenxin Yiyan: Check the code logic for batch sending
  • Claude Sonnet: I would first reproduce the issue the user encountered
  • GPT-4o: Check logs and error reports
  • Qwen Max: Collect specific user feedback

These answers seem reasonable but are actually correct nonsense. "Check logs," "reproduce the issue," "examine code"—aren't these what any engineer would do? The key is: which logs to check? What scenarios to reproduce? Which part of the code to examine?

More critically, these models completely ignored the key information in the question—individual sending works. Since individual sending is normal, why would you aimlessly "reproduce the problem"? It's like a doctor insisting on a full-body examination when the patient clearly states only their left leg hurts.

What the 20 and 60-Point Models Saw

Interestingly, DeepSeek V3, DeepSeek R1, and Claude Opus scored 20 points because they at least mentioned the keyword "concurrency." Gemini 2.5 Pro went further, scoring 60 points, because it specifically wanted to check logs for "that user's failed mass sending task" rather than logs in general.

But even 60-point Gemini lacked Doubao's systematic thinking: both collecting operation context and targeting batch-processing-specific logic. This structured problem decomposition ability is what distinguishes excellent engineers from average coders.

Why This Question Matters So Much

This isn't just an interview question. In real work, 80% of bug investigations follow a similar pattern: using known information to narrow down the problem scope and find the most likely failure point, rather than searching for a needle in a haystack.

The capability Doubao Pro demonstrated is exactly what we need AI to possess: not mechanically executing commands, but truly understanding problem context and making reasonable inferences and judgments.

This test exposed a harsh reality: while large models are getting stronger at answering knowledge-based questions, most models still remain at a "looks professional" level when engineering judgment is required. They use the right terminology but fail to provide genuinely useful advice.

The Next Battlefield for Large Models

What does the collective failure of top models like GPT-4, Claude 3.5, Wenxin 4.0, and Tongyi Qianwen on this question tell us?

Language capability improvements are approaching their ceiling; real differentiation will manifest in reasoning and judgment. Whoever can make AI think like experienced engineers will win the next phase of competition.

Doubao Pro's performance perhaps signals that domestic large models are taking a different path: instead of pursuing an arms race in parameter scale, they're deeply cultivating professional capabilities in specific domains. While other models compete on who can write more elegant essays, Doubao has begun thinking about solving real problems.

I predict more similar "professional capability tests" will emerge in the coming year, and those AIs that can only recite information will quickly be eliminated by the market. After all, we don't need assistants who speak beautifully—we need partners who can truly solve problems.


Data source: YZ Index | Run #33 | View raw data