Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Jul 1, 2026 13 Views - Read Source Winzheng Index

Doubao Pro Code Execution Smoke快测单日波动模型一致性

In the YZ Index June 2026 live test of 11 models, Doubao Pro’s Smoke Evaluation main ranking dropped from 85.91 yesterday to 67.32 today, a decline of 18.6 points. The core reason is that the code execution dimension fell from 83.30 to 44.50.

Data Breakdown: Single Dimension Drives the Decline

The code execution dimension dropped 38.8 points in a single day, while the material constraint dimension rose from 89.10 to 95.20, engineering judgment remained unchanged at 100.00, and task expression increased from 95.60 to 100.00. The main ranking is composed solely of code execution and material constraint weights, so the sharp decline in code execution directly dragged down the overall score.

Smoke Evaluation only includes 2 code execution questions and 2 material constraint questions per day, with a very small sample size. A single question score change can cause fluctuations of more than 30 points. The question draws differed between yesterday and today, and Doubao Pro’s performance on today’s two code execution questions differed significantly from yesterday’s.

True Degradation or Sampling Fluctuation?

The two side dimensions, engineering judgment and task expression, did not decline; material constraint actually improved, indicating that the model’s overall capability has not systematically degraded. The 38.8-point single-day drop in code execution far exceeds the normal sampling fluctuation range, but since the sample is only two questions, the chance impact of extreme questions cannot be ruled out.

If the model experienced true capability degradation, it would typically manifest across multiple dimensions simultaneously. Currently, only the code execution dimension is anomalous, while the other dimensions are stable or rising, which is more consistent with a single fluctuation caused by question sampling.

Should We Continue to Monitor?

A single-day fluctuation in Smoke Evaluation does not equate to a permanent decline in model capability. It is recommended to observe the same dimension’s score for 3–5 consecutive days. If the code execution dimension consistently falls below 60 and the standard deviation widens, then assess whether there is a consistency issue. At present, this is only a single anomalous record and does not constitute a signal requiring focused attention.

Doubao Pro’s integrity rating remains at pass, and no access threshold warnings have been triggered.

A 38.8-point drop in code execution is more likely the cost of a 10-question draw than a collapse of the model itself.

Data source: YZ Index | Run #206 | View raw data

Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Data Breakdown: Single Dimension Drives the Decline

True Degradation or Sampling Fluctuation?

Should We Continue to Monitor?

Related Reviews

Winzheng Index Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Winzheng Index GPT-5.5 Smoke Mainboard Drops 20.5 Points, Code Execution Falls from 100 to 50

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

Winzheng Index 豆包 Pro Smoke Evaluation Main Ranking Drops 13.8 Points, Code Execution Falls from 100 to 75