DeepSeek V3 Stability Plunges 21.4 Points: In-Depth Analysis of Model Output Consistency Crisis

DeepSeek V3 exhibited a contradictory performance in this week's evaluation: significant improvements in multiple capability metrics, with the overall score rising from 52.9 to 66.6, but a cliff-like drop in the stability dimension. This phenomenon of "enhanced capabilities but unstable output" deserves in-depth analysis.

Stability Metrics Interpretation: From Excellent to Dangerous

Stability score dropped from 53.4 to 32.0, which means the model produces significantly larger fluctuations in answer quality under the same or similar inputs. The stability dimension of the YZ Index is measured by calculating the standard deviation of scores from multiple tests, and the low score of 32.0 indicates that DeepSeek V3's output consistency has fallen to a dangerous level.

Specifically, users may encounter situations like this: for the same programming question, the first query yields a perfect solution, but the second receives code riddled with errors; for the same knowledge-based question, the answer quality may shift from professional and in-depth to superficial and erroneous.

The Paradox of Performance Improvement and Stability Deterioration

The data presents an interesting paradox:

  • Code execution capability surged by 42.6 points (20.2→62.8), an increase of 211%
  • Material constraint score grew by 15.9 points (62.3→78.2), an increase of 25.5%
  • Knowledge synthesis capability improved by 7.9 points (36.4→44.3), an increase of 21.7%

These improvements indicate that the model's peak performance on specific tasks has indeed been enhanced, but the collapse in stability means that this high performance is not reproducible every time.

Possible Technical Cause Analysis

1. Adjustment in Model Weight Update Strategy
DeepSeek may have adopted a more aggressive parameter optimization strategy in pursuit of higher task completion rates. While this strategy raises the quality ceiling of optimal outputs, it also increases the variance in the output distribution.

2. Changes in Temperature Parameter or Sampling Strategy
To enhance creativity and problem-solving capabilities, the model may have increased the temperature parameter or altered the top-p/top-k sampling strategy. This directly leads to increased randomness in outputs, manifesting as a decline in stability.

3. Imbalance in Multi-Task Learning Trade-offs
The significant improvement in code execution capability (+42.6 points) may come at the expense of stability in other tasks. When strengthening certain capabilities, the model may have disrupted its original internal balance.

Actual Impact on Users

A stability score of 32.0 means:

Using DeepSeek V3 in production environments carries higher risks. In critical business scenarios, multiple verification mechanisms must be implemented, or consider rolling back to a more stable version.

For developers, this instability may lead to increased debugging difficulties—the same prompt may produce vastly different results, making problem localization more complex.

Outlook and Suggestions

DeepSeek V3's update highlights a classic dilemma in AI model optimization: the balance between pursuing capability ceilings and maintaining output stability. The cost-performance score approaching full marks (99.1) indicates excellent performance in cost control, but the sacrifice in stability may offset this advantage.

It is recommended that the DeepSeek team prioritize addressing stability issues, which could include: introducing training objectives with output consistency constraints, implementing stricter quality control mechanisms, or providing a stability-prioritized inference mode for users to choose. On the path to AI practicality, stability and reliability are more important than occasional brilliance.


Data source: YZ Index | Raw data