Doubao Pro's Stability Plummets by 19.8 Points, Inconsistent Responses Become Its Achilles' Heel

The latest YZ Index evaluation results for Doubao Pro are jaw-dropping: its stability dimension plummeted from 54.5 points to 34.7 points, a staggering drop of 19.8 points. The problem hidden behind this number is far more serious than it appears on the surface—when an AI model can't even "make up its mind," how can users trust it?

Stability Collapse: From "Fairly Reliable" to "Wavering"

It's important to clarify that the "stability" dimension in the YZ Index doesn't measure accuracy, but rather the consistency of the model's responses. The calculation formula is max(0, 100-stddev×2), based on the standard deviation of scores from multiple responses to similar questions. A score of 34.7 means that Doubao Pro's answer quality fluctuates wildly when facing the same or similar questions.

To put it in perspective: It's like a doctor who prescribes cold medicine for your symptoms today, but tomorrow suggests you might have pneumonia for the same symptoms. This inconsistency is fatal in AI applications, especially in production environments requiring stable output. What enterprise users fear most is the "works today, suddenly fails tomorrow" scenario.

Comprehensive Scores Reveal Deeper Issues

Let's look at Doubao Pro's complete performance under the v6 evaluation system:

  • Code Execution: 65.00 points - Mediocre, can basically complete simple programming tasks
  • Material Constraints: 77.40 points - This is Doubao Pro's bright spot, indicating good performance in following given materials and constraints
  • Engineering Judgment (side list, AI-assisted evaluation): 49.90 points - Failing grade, judgment in actual engineering scenarios is concerning
  • Task Expression (side list, AI-assisted evaluation): 27.10 points - This score is simply catastrophic, showing severely inadequate ability to understand and express task requirements

The main list composite score of 70.58 points looks passable. But considering the fact of plummeting stability, the value of this score needs a big question mark. Between an unstable 70 and a stable 60, which would you choose?

Cost-Effectiveness Up 17 Points: Price Cut or Optimization?

Interestingly, Doubao Pro's cost-effectiveness improved from 71 to 88 points, an increase of 17 points. This typically means two possibilities: either a price reduction or performance improvement at the same price. But considering the significant decline in stability, I'm more inclined to believe this is a pricing strategy adjustment.

After all, is an unstable bargain really more valuable than a stable but slightly more expensive product? This is a question every procurement decision-maker needs to seriously consider.

The "False Prosperity" of Legacy Dimensions

If you only look at the legacy dimension data, you might think Doubao Pro has made tremendous progress:

  • Programming Ability: Soared from 23.2 to 65.6 points (+42.4)
  • Knowledge Work: Improved from 38.8 to 49.6 points (+10.8)
  • Long Context: Increased from 62.3 to 77.4 points (+15.1)

But these improvements pale in comparison to the stability collapse. A model that can produce excellent code today but might output garbage tomorrow is a developer's nightmare. It's like a sharp sword that could break at any moment—looks strong but you wouldn't dare use it.

Speculation on Underlying Technical Causes

A significant drop in stability typically points to several possible technical causes:

1. Overly Aggressive Model Updates - Immature optimization strategies may have been adopted to quickly improve performance in certain dimensions

2. Improper Inference Parameter Adjustments - Fine-tuning parameters like temperature and top-p may have increased output randomness

3. Load Balancing Issues - Different inference nodes might be running different versions or configurations of the model

4. Training Data Contamination - Newly added training data may have introduced conflicts or noise

Actual Impact on Users

The impact of this stability decline varies for different types of users:

Individual users might not feel it deeply; occasional "glitches" can be resolved by re-asking. But for enterprise users, especially companies that have integrated Doubao Pro into their production processes, this is a serious risk signal. Imagine if your customer service bot is polite today but suddenly becomes nonsensical tomorrow—what would customers think?

Developers are affected the most. Code generation, debugging suggestions, and architecture design all require high consistency. An unstable programming assistant is worse than no assistant because it introduces unpredictable errors.

Position Changes in the Competitive Landscape

In the current AI model competitive landscape, stability is an underestimated but extremely important metric. GPT-4's ability to maintain market leadership is largely due to its excellent stability. Users are willing to pay a premium for reliability.

Doubao Pro's significant stability decline might deter users who were considering migrating from other models. In this critical period of AI implementation, "cheap but unstable" is not an attractive label.

Suggestions for the Doubao Team

As a long-time observer of AI development, I'd like to offer the Doubao team some suggestions:

1. Immediately investigate the root cause of stability issues - This should be the highest priority task

2. Establish stricter version control and testing processes - Any updates should undergo stability testing

3. Consider offering both "stable" and "experimental" versions - Let users choose independently

4. Strengthen communication with users - Proactively explain issues and improvement plans

"In the AI era, stability trumps everything. A stable model scoring 90 points is far superior to one that swings between 60 and 100 points. Because once trust is lost, it's very difficult to rebuild."

Doubao Pro's performance this time sounds an alarm for the entire industry: while pursuing performance improvements, never neglect the fundamental skill of stability. After all, users don't need occasional brilliance—they need consistent reliability.


Data source: YZ Index | Run #37 | View raw data