豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

In the YZ Index June 2026 test of 11 models, the main ranking score of 豆包Pro dropped from 82.36 points yesterday to 72.50 points today, a decrease of 9.9 points. The core reason is that the code execution dimension fell from 100.00 points to 50.00 points, while the material constraint dimension rose from 60.80 points to 100.00 points; the average of the two directly lowered the main ranking.

Direct Impact of Code Execution Halving

Smoke Evaluation only has 2 code execution questions per day. 豆包Pro's code execution score today is 50.00, meaning at least one question's execution result did not meet the full score standard. This is in stark contrast to yesterday's 100.00. Material constraint rose by 39.2 points in the opposite direction, showing that the model performed better today in constraint following, but the main ranking only takes the two items of code execution and material constraint, which cannot offset the 50-point loss in code execution.

Question Draw Fluctuation or True Degradation

Smoke Evaluation has 10 questions per day, and score fluctuations due to random question draw are within normal range. The difference between 豆包Pro's code execution score of 50.00 today and 100.00 yesterday could be due to drawing two high-difficulty or low-matching questions. Engineering judgment rose from 56.50 to 100.00, and task expression rose from 94.00 to 100.00, indirectly confirming that the model has not shown a systematic decline in other capabilities.

If code execution remains low for multiple consecutive days, the possibility of true model degradation should be considered. Currently, with only one day's data, it is insufficient to determine degradation. The integrity rating remains pass, with no violation signals triggered.

Whether Special Attention is Needed

A single-day 9.9-point drop in the main ranking is not extreme in Smoke Evaluation, but the direct halving of the code execution dimension is worth noting. It is recommended to observe the standard deviation of the same dimension's scores for 3-5 consecutive days. If the standard deviation continues to widen, the stability score will be further pressured. Based solely on one Smoke data point, there is no need to downgrade the overall capability of 豆包Pro.

The occurrence of 50 points in code execution and 100 points in material constraint on the same day exposes the amplification effect of Smoke Quick Test on extreme fluctuations in a single dimension.

The main ranking score of 72.50 is still higher than some similar models, but continuous tracking of changes in the code execution score is the only reliable way to determine whether it has entered an adjustment period.


Data source: YZ Index | Run #182 | View raw data