Qwen Max Stability Plummets by 22.8 Points: Model Update Triggers Output Quality Volatility

Qwen Max exhibits extreme duality in this week's evaluation: on one hand, it shows significant improvements in complex tasks like programming and long-context processing, while on the other hand, it suffers a catastrophic decline in stability dimensions. This "fire and ice" performance warrants in-depth analysis.

Specific Manifestations of Stability Collapse

The stability score plummeted from 53.0 to 30.2 points, a staggering 42.8% decrease. By analyzing the problematic questions, we found issues concentrated on tasks that should be fundamental capabilities of the model. Although specific details of the failed questions are not fully presented in the data, based on the definition of the stability dimension, this indicates severe inconsistency in the model's output quality on identical or similar tasks.

The Contradiction Between Performance Improvement and Stability Decline

Data shows that Qwen Max has significant improvements across multiple dimensions:

  • Programming Capability: Jumped from 20.2 to 58.8 points, a 191% increase
  • Long Context Processing: Improved from 60.2 to 80.6 points, a 33.9% growth
  • Cost-effectiveness: Rose from 27.9 to 42.2 points, a 51.3% increase
  • Knowledge Work: Modest improvement of 6.4 points to 40.8

This phenomenon of "capability improvement but stability decline" is not uncommon in AI model updates. It typically points to a core issue: the model may have sacrificed output consistency and predictability while pursuing specific capability improvements.

Possible Technical Cause Analysis

Based on the anomalous patterns in the evaluation data, we speculate the following possibilities:

1. Model Version Switch
Qwen Max may have undergone a version update during the evaluation period. While the new version shows stronger capabilities in specific tasks, overall output stability has not been sufficiently validated.

2. Training Strategy Adjustment
The substantial improvement in programming capability (191%) suggests possible adoption of new training data or fine-tuning strategies. This targeted optimization may have caused the model's performance to become unstable in other tasks.

3. Inference Parameter Changes
Adjustments to the model's inference configurations, such as temperature parameters and sampling strategies, may have increased randomness in output results, thereby affecting stability scores.

Actual Impact on Users

The stability decline affects different user groups differently:

  • Developers: The improvement in programming capability is beneficial, but increased uncertainty in model output may complicate debugging
  • Content Creators: Knowledge work capability shows only modest improvement, but stability decline may lead to content quality fluctuations
  • Enterprise Users: Stability is a critical metric for production environments; a 22.8-point drop may impact business continuity

Outlook and Recommendations

Despite the overall score improving from 42.2 to 56.3, the significant decline in stability cannot be ignored. For the Qwen team, we recommend focusing on the following in subsequent updates:

  • Establish more comprehensive regression testing mechanisms to ensure new versions don't regress on fundamental tasks
  • Maintain balance in overall model performance while pursuing specific capability improvements
  • Provide version selection functionality, allowing users to choose between stable versions or performance versions based on their needs

For users, until Qwen Max's stability issues are resolved, we recommend maintaining caution in critical business scenarios or considering multi-model validation strategies to ensure output quality.


Data source: YZ Index | Raw Data