Doubao Pro's Stability Plummets by 19.8 Points, Inconsistent Responses Become Its Achilles' Heel

Mar 24, 2026 899 Views - Read Source Winzheng Index

豆包Pro 稳定性模型一致性评测分析 AI可靠性

The latest YZ Index evaluation results for Doubao Pro are jaw-dropping: its stability dimension plummeted from 54.5 points to 34.7 points, a staggering drop of 19.8 points. The problem hidden behind this number is far more serious than it appears on the surface—when an AI model can't even "make up its mind," how can users trust it?

Stability Collapse: From "Fairly Reliable" to "Wavering"

It's important to clarify that the "stability" dimension in the YZ Index doesn't measure accuracy, but rather the consistency of the model's responses. The calculation formula is max(0, 100-stddev×2), based on the standard deviation of scores from multiple responses to similar questions. A score of 34.7 means that Doubao Pro's answer quality fluctuates wildly when facing the same or similar questions.

To put it in perspective: It's like a doctor who prescribes cold medicine for your symptoms today, but tomorrow suggests you might have pneumonia for the same symptoms. This inconsistency is fatal in AI applications, especially in production environments requiring stable output. What enterprise users fear most is the "works today, suddenly fails tomorrow" scenario.

Comprehensive Scores Reveal Deeper Issues

Let's look at Doubao Pro's complete performance under the v6 evaluation system:

Code Execution: 65.00 points - Mediocre, can basically complete simple programming tasks
Material Constraints: 77.40 points - This is Doubao Pro's bright spot, indicating good performance in following given materials and constraints
Engineering Judgment (side list, AI-assisted evaluation): 49.90 points - Failing grade, judgment in actual engineering scenarios is concerning
Task Expression (side list, AI-assisted evaluation): 27.10 points - This score is simply catastrophic, showing severely inadequate ability to understand and express task requirements

The main list composite score of 70.58 points looks passable. But considering the fact of plummeting stability, the value of this score needs a big question mark. Between an unstable 70 and a stable 60, which would you choose?

Cost-Effectiveness Up 17 Points: Price Cut or Optimization?

Interestingly, Doubao Pro's cost-effectiveness improved from 71 to 88 points, an increase of 17 points. This typically means two possibilities: either a price reduction or performance improvement at the same price. But considering the significant decline in stability, I'm more inclined to believe this is a pricing strategy adjustment.

After all, is an unstable bargain really more valuable than a stable but slightly more expensive product? This is a question every procurement decision-maker needs to seriously consider.

The "False Prosperity" of Legacy Dimensions

If you only look at the legacy dimension data, you might think Doubao Pro has made tremendous progress:

Programming Ability: Soared from 23.2 to 65.6 points (+42.4)
Knowledge Work: Improved from 38.8 to 49.6 points (+10.8)
Long Context: Increased from 62.3 to 77.4 points (+15.1)

But these improvements pale in comparison to the stability collapse. A model that can produce excellent code today but might output garbage tomorrow is a developer's nightmare. It's like a sharp sword that could break at any moment—looks strong but you wouldn't dare use it.

Speculation on Underlying Technical Causes

A significant drop in stability typically points to several possible technical causes:

1. Overly Aggressive Model Updates - Immature optimization strategies may have been adopted to quickly improve performance in certain dimensions

2. Improper Inference Parameter Adjustments - Fine-tuning parameters like temperature and top-p may have increased output randomness

3. Load Balancing Issues - Different inference nodes might be running different versions or configurations of the model

4. Training Data Contamination - Newly added training data may have introduced conflicts or noise

Actual Impact on Users

The impact of this stability decline varies for different types of users:

Individual users might not feel it deeply; occasional "glitches" can be resolved by re-asking. But for enterprise users, especially companies that have integrated Doubao Pro into their production processes, this is a serious risk signal. Imagine if your customer service bot is polite today but suddenly becomes nonsensical tomorrow—what would customers think?

Developers are affected the most. Code generation, debugging suggestions, and architecture design all require high consistency. An unstable programming assistant is worse than no assistant because it introduces unpredictable errors.

Position Changes in the Competitive Landscape

In the current AI model competitive landscape, stability is an underestimated but extremely important metric. GPT-4's ability to maintain market leadership is largely due to its excellent stability. Users are willing to pay a premium for reliability.

Doubao Pro's significant stability decline might deter users who were considering migrating from other models. In this critical period of AI implementation, "cheap but unstable" is not an attractive label.

Suggestions for the Doubao Team

As a long-time observer of AI development, I'd like to offer the Doubao team some suggestions:

1. Immediately investigate the root cause of stability issues - This should be the highest priority task

2. Establish stricter version control and testing processes - Any updates should undergo stability testing

3. Consider offering both "stable" and "experimental" versions - Let users choose independently

4. Strengthen communication with users - Proactively explain issues and improvement plans

"In the AI era, stability trumps everything. A stable model scoring 90 points is far superior to one that swings between 60 and 100 points. Because once trust is lost, it's very difficult to rebuild."

Doubao Pro's performance this time sounds an alarm for the entire industry: while pursuing performance improvements, never neglect the fundamental skill of stability. After all, users don't need occasional brilliance—they need consistent reliability.

Data source: YZ Index | Run #37 | View raw data

Doubao Pro's Stability Plummets by 19.8 Points, Inconsistent Responses Become Its Achilles' Heel

Stability Collapse: From "Fairly Reliable" to "Wavering"

Comprehensive Scores Reveal Deeper Issues

Cost-Effectiveness Up 17 Points: Price Cut or Optimization?

The "False Prosperity" of Legacy Dimensions

Speculation on Underlying Technical Causes

Actual Impact on Users

Position Changes in the Competitive Landscape

Suggestions for the Doubao Team

Related Reviews

Winzheng Index DeepSeek V4 Pro Main Score Plummets 11.9 Points, Code Execution Drops 13.3

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day