Doubao Pro Stability Plunges 19.8 Points: Inconsistent Answers to Same Questions Become Biggest Weakness

Doubao Pro exhibited a thought-provoking phenomenon in this week's Winzheng AI evaluation: while the overall score increased by 16.1 points, the stability dimension bucked the trend and dropped by 19.8 points, from 54.5 to 34.7. This data reveals the severe challenges the model faces in maintaining answer consistency.

The True Meaning of Stability Scoring

It needs to be clarified that the "stability" dimension in the YZ Index does not measure the accuracy rate of answers, but evaluates the model's degree of consistency in responses to the same or similar questions. This volatility is quantified by calculating the standard deviation of multiple answers. The low score of 34.7 indicates that Doubao Pro shows significant answer dispersion in repeated tests.

Data Comparison Reveals the Contrast

The data from this evaluation presents distinct contrast features:

  • Code execution capability surged by 42.4 points, reaching 65.6
  • Cost-effectiveness improved by 17 points to 88, ranking high
  • Material constraint capability grew by 15.1 points
  • Knowledge synthesis capability moderately increased by 10.8 points

This phenomenon of "capability improvement but stability decline" suggests that Doubao Pro may have undergone major model adjustments or strategy optimizations recently.

Possible Technical Reasons

The sharp decline in stability may stem from the following technical factors:

1. Overly Aggressive Temperature Parameter Adjustment
The temperature parameter of AI models controls the randomness of outputs. If Doubao Pro increased the temperature value to enhance creativity and diversity, it would lead to significantly different outputs for the same input. From the substantial improvement in code execution capability, the model may be pursuing more flexible solutions.

2. Changes in Multi-Model Routing Strategy
Modern AI services typically employ multiple sub-models working collaboratively. If Doubao Pro adjusted its internal model routing strategy, allowing different sub-models to handle similar requests, it would produce differences in style and content. While this strategy can improve performance in certain dimensions, it sacrifices consistency.

3. Updates to Training Data or Fine-Tuning Strategies
Considering the significant changes across multiple dimensions, Doubao Pro likely underwent a model version update. New training data or fine-tuning methods may have enhanced specialized capabilities but have not yet achieved balance in output consistency.

Actual Impact on User Experience

The decline in stability has varying impacts on different usage scenarios:

  • Development Scenarios: The 42.4-point increase in code generation capability may offset stability issues, as developers value the quality of solutions over consistency
  • Content Creation: Long-form writing that requires consistent style may be affected
  • Customer Service Applications: Scenarios requiring standardized responses will face challenges, needing additional prompt engineering to constrain outputs

Industry Trends and Technical Trade-offs

Doubao Pro's changes reflect a common dilemma in the AI industry: how to find a balance between model capabilities and output stability. As model scales expand and capabilities enhance, maintaining output predictability becomes increasingly challenging.

From the evaluation data, the Doubao team may have chosen a "capability-first" strategy, sacrificing some stability to achieve breakthroughs in key areas like code execution and material understanding. This choice has its rationality in the current intense AI competition, but in the long term, a better balance point is still needed.

Future Outlook and Suggestions

Based on the current data, the Doubao Pro team needs to focus on the following directions:

Optimize sampling strategies during inference to improve output consistency while maintaining innovation; establish a more comprehensive A/B testing mechanism to fully assess stability impacts before official release; consider providing configurable stability parameters for different usage scenarios.

It is worth noting that despite the sharp decline in stability, Doubao Pro's overall score still increased by 16.1 points, indicating that users may value actual capability improvements more. However, as an important indicator for commercial applications, the low stability score of 34.7 still requires attention. In today's rapidly iterating AI technology landscape, finding the optimal balance between innovation and stability will determine whether Doubao Pro can maintain its advantage in fierce market competition.


Data source: YZ Index | Raw Data