11 AIs Tackle the Same Logic Puzzle, 3 Failures Expose Reasoning Black Holes

Mar 21, 2026 1,184 Views - Read Source Winzheng Index

DeepSeek Grok 逻辑推理 Model Evaluation 认知盲区

When I saw DeepSeek V3's answer to this question, my first reaction was that the testing system had a bug. This is a model that claims "reasoning capabilities comparable to GPT-4" - how could it fail on such a basic logic puzzle? However, after repeated verification, the cruel truth was undeniable: it wasn't a bug, it truly couldn't solve it.

How simple is this puzzle? 5 people to rank with 4 constraints - any human with middle school education could deduce the answer in 2 minutes. The correct answer is: A, D, C, B, E. The logic chain is crystal clear: C is fixed at 3rd place, A>B and A>D>E, B is not last, therefore A is 1st, D is 2nd, B is 4th, E is 5th.

The Most Absurd Error: Grok Interpreted the Question as Alphabetical Sorting

Grok 3's answer made me wonder if it was intentionally joking: A, B, C, D, E. The only "logic" in this answer is alphabetical order. As the latest work from Musk's xAI, Grok's performance is catastrophic. It completely ignored all constraints in the question, as if saying: "Who cares about logical reasoning, alphabetical order is justice."

This exposes a terrifying problem: Grok might not have understood this was a reasoning puzzle at all. Upon seeing the five letters A, B, C, D, E, it directly triggered some sort of "sorting mode" and output the laziest answer. If this counts as AI, then my Excel spreadsheet should also be considered artificial intelligence.

DeepSeek's Collective Failure: Why Do Powerful Models Stumble on Simple Questions?

Even more puzzling is the performance of DeepSeek V3 and R1. They gave the same incorrect answer: A, D, E, C, B. Where's the error? They placed E in 3rd position, completely ignoring the most explicit condition that "C is 3rd."

Analyzing DeepSeek's error pattern carefully, I discovered an interesting pattern: they correctly identified the A>D>E ordering relationship but experienced a "cognitive disconnect" when handling C's fixed position. This error pattern exposes a common problem in current AI: when processing multiple constraints, models may develop "selective blindness," prioritizing inference chains while forgetting the most basic hard constraints.

This reminds me of the human "tunnel vision" phenomenon - when we focus too much on a complex problem, we might overlook the most obvious facts. AI models seem to have inherited this trait of being "too clever for their own good."

The 8 Models That Got It Right: Who's Really "Thinking"?

Among the 8 models that answered correctly, Claude Sonnet 4.6 and Claude Opus 4.6 performed the best. They not only provided the correct answer but also demonstrated complete reasoning processes. The Claude series particularly stood out by explicitly pointing out the crucial reasoning step that "B must be in 4th place" - something other models didn't clearly state.

ERNIE Bot4.0, Gemini 2.5 Pro, GPT-4o, and Qwen Max all gave correct answers but with relatively simple reasoning processes. Doubao Pro performed adequately, answering correctly but without showing the thinking process. The latest GPT-o3 (likely a version of o1) also answered correctly but was similarly terse.

From this distribution, we can see that OpenAI and Anthropic models indeed excel at logical reasoning, while among Chinese models, ERNIE Bot and Qwen also performed quite reliably.

Three Major AI Weaknesses Exposed by This Puzzle

First, imbalanced constraint satisfaction capabilities. The more explicit and simple the constraint (like "C is 3rd"), the more likely some models are to ignore it. This might be because models prioritize learning complex reasoning chains during training while giving insufficient weight to simple facts.

Second, fragility of reasoning. This puzzle has only 5 elements and 4 constraints, yet 27% of models failed. If extended to more complex real-world scenarios, such as project scheduling or resource allocation involving dozens of variables, AI reliability would be severely compromised.

Third, unpredictability of errors. DeepSeek V3 performs excellently on many complex tasks yet failed on this simple puzzle. This "strength-weakness inversion" phenomenon shows that we still cannot accurately predict where AI will fail, which is a huge risk for critical business applications.

Implications for AI Applications

This test sounds an alarm for all AI application developers: don't assume that because a model performs excellently on complex tasks, it will be infallible on simple ones. When designing AI systems, consider the following:

1. Cross-validate critical decision results, preferably using multiple models
2. Design "sanity check" mechanisms for AI systems to catch obvious logical errors
3. In scenarios involving hard constraints, consider using rule engines rather than pure AI reasoning

The deeper issue is that current AI training methods may have fundamental flaws. Massive parameters and computational power don't guarantee that models truly understand logical rules. The failure of 3 models in this test might herald a paradigm shift needed in large model development - from simply pursuing parameter scale to improving reasoning reliability and consistency.

If AI can get the ranking of 5 people wrong, why should we trust it to correctly handle autonomous driving, medical diagnosis, or financial decisions? This isn't a technical issue, it's a trust issue.

Data source: YZ Index | Run #33 | View raw data