11 AI Models Solve the Same Logic Puzzle, 5 Correct and 6 Collectively Wrong

This puzzle seems simple, but it directly reveals the true level of current large models in multi-condition chain reasoning. The problem gives four constraints: A is better than B, C is third, D is better than E and worse than A, and B is not last. The only correct answer is A,D,C,B,E.

Common Features of Correct Models

The five models scoring 100 (Doubao Pro, Qwen3 Max, Gemini 2.5 Pro, GPT-5.5, GPT-o3) directly output A,D,C,B,E without extra explanations. When handling the hard constraints of "A > D > E" and "C fixed third," they did not incur positional conflicts, indicating relatively stable internal maintenance of partial order relations.

Typical Failure Paths of Incorrect Models

Among the six models scoring 0, Claude Opus 4.7 showed the most representative behavior. It first wrote A,C,D,B,E, then overturned itself, but after re-reasoning still returned to A,B,C,D,E. The entire process exposed a positional allocation conflict when the conditions "put A before C" and "D must be after A" act simultaneously.

DeepSeek V4 Pro, Gemini 3.1 Pro, Grok 4, Wenxin Yiyan 4.5, and Claude Sonnet 4.6 directly output A,B,C,D,E, indicating they ignored the key restriction that "D must be worse than A," placing D in a position after A.

Real Gap in Engineering Judgment Dimension

This test essentially examines engineering judgment (side ranking, AI-assisted evaluation) capability. Correct models completed multi-condition sorting relying on internal consistency without external tools; incorrect models lost at least one constraint in the chain dependency. This is not directly related to pure knowledge memorization or code execution ability, but more reflects the strength of the model's maintenance of partial order relations.

When a model cannot simultaneously satisfy "A must be before C" and "D must be after A," the ranking inevitably collapses.

Notably, some incorrect models (such as Claude Opus) attempted self-correction before output, but still returned incorrect results, indicating that their internal consistency check mechanism did not truly take effect.

Implications for Practical Applications

In scenarios requiring strict multi-condition sorting (such as task priority, resource allocation, schedule arrangement), directly calling most current models still carries a 55% error probability. It is recommended to add an external validation layer in production environments, or at least require the model to output a complete reasoning chain for manual quick verification.

This 11-model test once again confirms: logical reasoning is not a linear function of model scale, but a direct test of constraint maintenance capability.


Data source: YZ Index (YZ Index) | Run #122 | View Raw Data