Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

This week's evaluation data shows that Claude Opus 4.6's stability score experienced a cliff-like drop, falling from last week's 53.5 points to 31.0 points, a decrease of 42.1%. This abnormal performance has sparked widespread concern in the industry about the stability of this model version.

Specific Manifestations of Stability Issues

Through detailed analysis of failed test questions, we found that stability issues are mainly concentrated in the following areas:

Test Question Example: Generate structured JSON format product information
Expected Output: Standard JSON format
Actual Output: Some tests returned mixed formats, including Markdown and plain text content

In multi-turn dialogue tests, the model showed obvious inconsistencies in context understanding. Responses to the same question at different time points varied significantly, directly impacting the stability score.

The Contradiction Between Performance Improvement and Stability Decline

Notably, while stability plummeted, Claude Opus 4.6 excelled in other dimensions:

  • Programming Capability Leap: Increased from 20.2 to 62.2 points, a 208% growth
  • Long Context Processing: Improved from 66.7 to 74.6 points, an 11.8% increase
  • Knowledge Work Capability: Rose from 37.8 to 43.3 points, a 14.6% growth

This "trade-off" phenomenon suggests the model may have undergone optimization adjustments targeting specific capabilities, but these adjustments might have come at the cost of output consistency.

Analysis of Possible Technical Reasons

Based on evaluation data and industry experience, the stability decline may stem from the following technical factors:

1. Side Effects of Model Weight Adjustments
Model fine-tuning to enhance programming capabilities may have affected output stability in other tasks. Programming tasks typically require stronger logical reasoning abilities, and enhancing this capability may have altered the model's overall behavioral patterns.

2. Changes in Sampling Parameter Configuration
Output inconsistency suggests possible adjustments to temperature parameters or other sampling strategies. While higher temperature settings can increase creativity, they also reduce output predictability.

3. Side Effects from Inference Optimization
The cost-effectiveness improvement from 2.8 to 4.0 points (42.9% increase) hints at possible inference efficiency optimization. Such optimization is sometimes achieved through techniques like quantization or pruning, which may affect model stability.

Actual Impact on Users

The impact of stability decline varies across different application scenarios:

  • Production Environment Applications: Enterprise applications requiring highly consistent output may face challenges
  • Creative Tasks: Scenarios requiring output diversity might actually benefit
  • Development and Debugging Scenarios: The significant improvement in programming capabilities makes it more competitive in code-related tasks

Outlook and Recommendations

Claude Opus 4.6's overall score improved from 40.3 to 51.3, indicating that overall performance is still advancing. However, the significant stability decline reminds us that AI model optimization is a complex process requiring balance across multiple dimensions.

For users, we recommend selecting appropriate model versions based on specific application scenarios. If applications require high output consistency, it may be necessary to wait for stability improvements in subsequent versions; for programming and long-text processing tasks, the capabilities demonstrated by the new version are worth trying.

These evaluation results once again prove that AI model evolution is not linear progress, but rather a continuous exploration process seeking optimal balance among different capability dimensions.


Data Source: YZ Index | Raw Data