Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

Mar 22, 2026 766 Views - Read Source winzheng.com

Claude 稳定性测试 AI Evaluation 性能波动输出格式

This week's evaluation data shows that Claude Opus 4.6's stability score experienced a cliff-like drop, falling from last week's 53.5 points to 31.0 points, a decrease of 42.1%. This abnormal performance has sparked widespread concern in the industry about the stability of this model version.

Specific Manifestations of Stability Issues

Through detailed analysis of failed test questions, we found that stability issues are mainly concentrated in the following areas:

Test Question Example: Generate structured JSON format product information
Expected Output: Standard JSON format
Actual Output: Some tests returned mixed formats, including Markdown and plain text content

In multi-turn dialogue tests, the model showed obvious inconsistencies in context understanding. Responses to the same question at different time points varied significantly, directly impacting the stability score.

The Contradiction Between Performance Improvement and Stability Decline

Notably, while stability plummeted, Claude Opus 4.6 excelled in other dimensions:

Programming Capability Leap: Increased from 20.2 to 62.2 points, a 208% growth
Long Context Processing: Improved from 66.7 to 74.6 points, an 11.8% increase
Knowledge Work Capability: Rose from 37.8 to 43.3 points, a 14.6% growth

This "trade-off" phenomenon suggests the model may have undergone optimization adjustments targeting specific capabilities, but these adjustments might have come at the cost of output consistency.

Analysis of Possible Technical Reasons

Based on evaluation data and industry experience, the stability decline may stem from the following technical factors:

1. Side Effects of Model Weight Adjustments
Model fine-tuning to enhance programming capabilities may have affected output stability in other tasks. Programming tasks typically require stronger logical reasoning abilities, and enhancing this capability may have altered the model's overall behavioral patterns.

2. Changes in Sampling Parameter Configuration
Output inconsistency suggests possible adjustments to temperature parameters or other sampling strategies. While higher temperature settings can increase creativity, they also reduce output predictability.

3. Side Effects from Inference Optimization
The cost-effectiveness improvement from 2.8 to 4.0 points (42.9% increase) hints at possible inference efficiency optimization. Such optimization is sometimes achieved through techniques like quantization or pruning, which may affect model stability.

Actual Impact on Users

The impact of stability decline varies across different application scenarios:

Production Environment Applications: Enterprise applications requiring highly consistent output may face challenges
Creative Tasks: Scenarios requiring output diversity might actually benefit
Development and Debugging Scenarios: The significant improvement in programming capabilities makes it more competitive in code-related tasks

Outlook and Recommendations

Claude Opus 4.6's overall score improved from 40.3 to 51.3, indicating that overall performance is still advancing. However, the significant stability decline reminds us that AI model optimization is a complex process requiring balance across multiple dimensions.

For users, we recommend selecting appropriate model versions based on specific application scenarios. If applications require high output consistency, it may be necessary to wait for stability improvements in subsequent versions; for programming and long-text processing tasks, the capabilities demonstrated by the new version are worth trying.

These evaluation results once again prove that AI model evolution is not linear progress, but rather a continuous exploration process seeking optimal balance among different capability dimensions.

Data Source: YZ Index | Raw Data

Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

Specific Manifestations of Stability Issues

The Contradiction Between Performance Improvement and Stability Decline

Analysis of Possible Technical Reasons

Actual Impact on Users

Outlook and Recommendations

Related Reviews

Winzheng Index Claude Opus 4.7 Tops with 96.99: 2026-07-23 Smoke Quick Test Data Brief

Winzheng Index Grok 4 Leads with 98.35 Points: 2026-07-22 Smoke Quick Test Data Brief

Winzheng Index Claude Sonnet 4.6 and GPT-o3 Tie at 96.27: 2026-07-21 Smoke Quick Test Data Brief

Winzheng Index Claude Opus 4.7 Leads with 100 Points: 2026-07-20 Smoke Quick Test Data Brief