豆包 Pro Smoke Evaluation Main Ranking Drops 13.8 Points, Code Execution Falls from 100 to 75

In the June 2026 YZ Index real-world evaluation of 11 models, 豆包 Pro's main ranking score dropped from 98.61 points yesterday to 84.77 points today, a decrease of 13.8 points.

Core Dimension Breakdown

The code execution dimension directly fell from 100.00 points to 75.00 points, a drop of 25 points, and was the sole decisive factor for the main ranking decline. The material constraint dimension only slightly decreased from 96.90 points to 96.70 points, a drop of 0.2 points. Engineering judgment fell from 97.20 points to 89.60 points, and task expression fell from 100.00 points to 99.40 points. The integrity rating remained pass.

The Smoke evaluation uses only 2 questions per dimension per day, resulting in an extremely small daily sample size. The 25-point drop in the code execution dimension is very likely due to a sudden change in the difficulty or type of drawn questions, rather than an overall degradation of model capability. The material constraint dimension remains at 96.70 points, indicating no systematic change in the model's ability to adhere to given materials.

Volatility Nature Judgment

Both the engineering judgment and task expression side dimensions saw declines within 7.6 points, and the main ranking calculation only relies on the two auditable dimensions of code execution and material constraint. Therefore, the 13.8-point drop in the main ranking is almost entirely determined by the single code execution dimension. This matches the typical characteristics of small-sample rapid testing: individual high-difficulty or edge-case questions can cause large score swings.

If the model had experienced genuine degradation, it would typically affect both the material constraint and code execution main ranking dimensions simultaneously. However, today's material constraint only dropped 0.2 points, indicating that the model's fundamental capability framework has not collapsed. The 25-point drop in code execution is closer to a random shock caused by question draw.

Whether Continuous Monitoring Is Needed

A single day of Smoke data fluctuation falls within the normal range, and there is no immediate need to determine model capability degradation. It is recommended to continuously observe the same dimension's scores over the next 3–5 days. If code execution consistently remains below 85 points and material constraint simultaneously trends lower, then consider triggering an in-depth evaluation. Current data only shows one instance of draw anomaly and does not constitute a signal of model stability risk.

豆包 Pro exhibits sensitivity to specific question types in the code execution dimension, which is amplified in small-sample rapid testing. The main ranking score of 84.77 points still exceeds the baseline of most models, and the core capability chassis remains intact.

A single draw causing a 25-point drop does not equate to model degradation; only three consecutive days of low performance warrant true vigilance.

Data source: YZ Index (赢政指数) | Run #203 | View raw data