Doubao Pro Material Constraint Plunges 24 Points, Code Execution Soars from 38.4 to 100

In today's Smoke evaluation, Doubao Pro's Material Constraint score dropped from 84.80 to 60.80, a decrease of 24 points; Code Execution rose from 38.40 to 100.00, an increase of 61.6 points, and the main ranking score increased from 59.28 to 82.36.

Extreme Reverse Fluctuations Point to Question Sampling Probability

Smoke evaluation only includes 10 questions per day, with 2 questions per dimension. Material Constraint and Code Execution both exhibited a gap of more than 60 points, which statistically resembles a small-sample lottery result rather than a structural change in model capability within 24 hours. If a Material Constraint question draws a case that requires strict adherence to user instructions or refuses out-of-bounds scenarios, the score can drop precipitously; if a Code Execution question draws a simple Python or SQL task, it is very easy to achieve a perfect score.

Engineering Judgment dropped from 84.50 to 56.50, also showing a significant decline, further confirming that today's question set deviates from yesterday's distribution. Task Expression increased by only 0.5 points, remaining relatively stable, indicating that the model's underlying generation ability has not undergone systematic degradation.

Evidence of Low Probability for Actual Degradation

If the model were to experience genuine capability degradation, it would typically be accompanied by simultaneous and sustained declines across multiple dimensions, rather than a plunge in one dimension while another soars. Doubao Pro's main ranking score actually increased by 23.1 points today, indicating that the perfect Code Execution score boosts the overall ranking far more than the Material Constraint loss drags it down. The integrity rating remains "pass", with no violation signals triggered.

Under the YZ Index daily Smoke evaluation framework, models with a standard deviation exceeding 20 points in a single day have scores that more reflect question randomness than stable capability. Doubao Pro's combination of a 60.80 Material Constraint score and a 100.00 Code Execution score is a typical high-variance sample.

Need for Continued Monitoring

A single dramatic fluctuation in a Smoke test does not constitute sufficient evidence of model capability degradation. It is advisable to observe the median of Material Constraint over 3–5 consecutive trading days; if the dimension consistently falls below 70 points with a still-high standard deviation, then trigger an in-depth evaluation. Current data only shows statistical noise from the day's question sampling.

For application scenarios relying on Material Constraint, developers can temporarily add prompt validation or post-processing filters to hedge against the risk of single-day volatility.

The simultaneous occurrence of a 24-point plunge and a 61.6-point surge shows that the real variable in today's Smoke test is the questions, not the model.

Data Source: YZ Index | Run #176 | View Raw Data