Technical Risks Behind Doubao Pro's Sharp Decline in Stability

Doubao Pro's evaluation data this week presents an anomalous phenomenon: while showing substantial improvements across multiple dimensions including programming and knowledge work, its stability score plummeted from 54.5 to 34.7, a dramatic 36.3% decrease. This pattern of "simultaneous progress and regression" merits thorough analysis.

Specific Manifestations of Stability Issues

Analysis of failed test items reveals that Doubao Pro's stability issues are primarily concentrated in three areas:

1. Degradation in Complex Reasoning Abilities

In the classic "frog jumping out of a well" problem, the model provided an incorrect answer: "The frog can jump out of the well on day 4." The correct answer should be day 3, as the frog reaches the well's rim after jumping 3 meters during the day on day 3.

This error on a basic logic problem indicates the model has developed judgment biases when handling problems requiring step-by-step reasoning. More concerning is that such problems typically serve as fundamental capability tests for large language models.

2. Decreased Mathematical Calculation Accuracy

In simple probability calculations, the model frequently made computational errors. For example, in dice-rolling probability problems, it incorrectly calculated the probability of "at least one 6" as 11/36, when the correct answer should be 1-(5/6)²=11/36.

3. Code Generation Consistency Issues

Despite the programming dimension score improving by 42.4 points overall, the model exhibited significant instability in certain code generation tasks. The quality of code generated for the same requirements varied considerably across different test rounds—sometimes producing high-quality code, other times generating syntax errors or logical flaws.

Analysis of Possible Technical Causes

Combined with score changes across dimensions, this "trade-off" phenomenon may stem from several technical factors:

  • Side effects of model version updates: Doubao Pro may have undergone adjustments to model architecture or parameters, inadvertently affecting the stability of basic reasoning capabilities while optimizing certain abilities (such as programming and long-text processing).
  • Changes in training data distribution: The new version may have adjusted the training data composition, overemphasizing performance improvements in specific domains, leading to decreased generalization performance in basic capabilities.
  • Adjustments to inference optimization strategies: To improve response speed and reduce computational costs (cost-effectiveness score increased by 17 points), more aggressive inference optimization strategies may have been adopted, sacrificing some accuracy.

Impact Assessment and Outlook

The decline in stability has non-negligible implications for Doubao Pro's practical applications. In scenarios requiring high reliability, such as financial calculations, medical diagnosis assistance, and critical code generation, this instability could pose serious risks. Users need to add manual review processes, which partially offset the efficiency gains from improvements in other dimensions.

Notably, Doubao Pro's overall score still improved by 16.1 points, indicating that its performance in most application scenarios continues to improve. However, the sharp decline in stability—one of the core metrics for AI models—exposes a common dilemma in current AI model optimization: how to maintain balanced and stable overall model performance while pursuing specific capability enhancements.

We recommend that the Doubao team focus on regression testing for basic reasoning capabilities and establish more comprehensive model update evaluation mechanisms to avoid "robbing Peter to pay Paul" optimization strategies. For users, we suggest employing cross-validation with multiple models for critical tasks to ensure result reliability.


Data source: YZ Index | Raw Data