This week, Qwen Max showed a significant decline in the knowledge work dimension, with scores dropping from 81.6 to 71.8, a decrease of 9.8 points. This change primarily stems from the model's severely deteriorated performance in logical reasoning tasks, particularly in the classic "who lied" reasoning puzzle, where scores plummeted from 50 to 25 points.
Systematic Bias Emerges in Logical Reasoning Capabilities
In the "who lied" test question, the model needs to perform logical deduction based on the condition that "exactly one person is lying." The model's response demonstrated a seemingly complete reasoning process:
Reasoning process: 1. Assume A is telling the truth, meaning A didn't steal the cookies. Then exactly one of B and C must be telling the truth and one must be lying... 2. Assume A is lying, meaning A actually stole the cookies. In this case, B's statement "A stole them" is true. C's statement "B stole them" is false.
On the surface, the model conducted a case-by-case analysis, but careful examination reveals critical gaps in its reasoning chain. While the model correctly identified contradictions in the first assumption, it was too hasty in drawing conclusions, failing to fully verify whether the second assumption truly satisfied all constraint conditions. This issue of "formally complete but logically unsound" reasoning reflects systematic deficiencies in the model's handling of multi-constraint logical reasoning.
Programming Performance Shows Divergence: Basic Problems Show Clear Regression
Although the programming dimension's overall score increased slightly by 2.4 points, performance on specific problems showed severe divergence. In the concurrent race condition analysis problem, while the model correctly identified the core issue—"race conditions in multi-threaded environments"—and proposed using threading.Lock as a solution, the score dropped from 40 to 20 points. This may be because the model's answer was too general, lacking in-depth analysis of the specific mechanisms of race conditions.
More concerning is the performance regression on the classic FizzBuzz programming problem. The model provided a Python one-liner solution:
return ['Fizz' * (i % 3 == 0) + 'Buzz' * (i % 5 == 0) or str(i) for i in range(1, n+1)]
While this solution is concise and functionally correct, the score dropped from 83.3 to 66.7. This decline in scoring on basic problems may reflect changes in evaluation criteria or deficiencies in the model's handling of code readability, edge cases, and other details.
Limitations in Long-Text Comprehension Become Apparent
In the contract risk review task, the model's score dropped from 57.1 to 42.9. While the model accurately identified two key risk points—liability for breach and intellectual property—the response was truncated and failed to fully elaborate on all risks. This phenomenon exposes potential issues with uneven attention distribution or improper output length control when the model handles long-text tasks requiring comprehensive analysis.
Technical Analysis and Outlook
Overall, Qwen Max's performance regression this week is concentrated in three areas: insufficient rigor in logical reasoning, declining detail-handling capability in basic programming problems, and lack of completeness in long-text tasks. These issues may stem from parameter adjustments during model training or inference, or may reflect neglect of fundamental capabilities while pursuing optimization of certain metrics.
Notably, the model's stability score also decreased by 7.5 points, which corroborates the volatility observed across various task performances. For Qwen Max, positioned as a general-purpose large model, maintaining stability of fundamental capabilities while pursuing innovation will be a key challenge for continuous improvement.
Data source: YZ Index | Raw Data | YZ Index Homepage
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接