In the YZ Index June 2026 live test of 11 models, Doubao Pro’s Smoke Evaluation main ranking dropped from 85.91 yesterday to 67.32 today, a decline of 18.6 points. The core reason is that the code execution dimension fell from 83.30 to 44.50.
Data Breakdown: Single Dimension Drives the Decline
The code execution dimension dropped 38.8 points in a single day, while the material constraint dimension rose from 89.10 to 95.20, engineering judgment remained unchanged at 100.00, and task expression increased from 95.60 to 100.00. The main ranking is composed solely of code execution and material constraint weights, so the sharp decline in code execution directly dragged down the overall score.
Smoke Evaluation only includes 2 code execution questions and 2 material constraint questions per day, with a very small sample size. A single question score change can cause fluctuations of more than 30 points. The question draws differed between yesterday and today, and Doubao Pro’s performance on today’s two code execution questions differed significantly from yesterday’s.
True Degradation or Sampling Fluctuation?
The two side dimensions, engineering judgment and task expression, did not decline; material constraint actually improved, indicating that the model’s overall capability has not systematically degraded. The 38.8-point single-day drop in code execution far exceeds the normal sampling fluctuation range, but since the sample is only two questions, the chance impact of extreme questions cannot be ruled out.
If the model experienced true capability degradation, it would typically manifest across multiple dimensions simultaneously. Currently, only the code execution dimension is anomalous, while the other dimensions are stable or rising, which is more consistent with a single fluctuation caused by question sampling.
Should We Continue to Monitor?
A single-day fluctuation in Smoke Evaluation does not equate to a permanent decline in model capability. It is recommended to observe the same dimension’s score for 3–5 consecutive days. If the code execution dimension consistently falls below 60 and the standard deviation widens, then assess whether there is a consistency issue. At present, this is only a single anomalous record and does not constitute a signal requiring focused attention.
Doubao Pro’s integrity rating remains at pass, and no access threshold warnings have been triggered.
A 38.8-point drop in code execution is more likely the cost of a 10-question draw than a collapse of the model itself.
Data source: YZ Index | Run #206 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接