Doubao Pro's main index in today's Smoke evaluation directly dropped from 96.06 to 77.64, a single-day decline of 18.4 points. Among them, the code execution dimension fell off a cliff from 97.50 to 66.70, a drop of up to 30.8 points, while material constraints decreased by only 3.3 points. Such data is uncommon in the daily 10-question quick test.
Small-sample sampling or real capability fluctuation?
Smoke evaluation only draws 2 questions per dimension each day, with a very small sample size, so the standard deviation of a single day's score is inherently large. The 30.8-point drop in the code execution dimension this time is highly likely due to the randomness of the sampling of question difficulty. For example, if today's two code questions involve complex multi-step reasoning or edge API calls, even a single step error by the model will directly lead to a low score.
However, the integrity rating changed directly from pass to warn, which cannot be fully explained by luck. A warn rating usually indicates that the model has audit-trail issues in answer consistency or format compliance, and it deserves continuous monitoring in the future.
Recent industry developments and model iteration background
ByteDance recently positioned Doubao Pro as the flagship enterprise model, focusing on strengthening code and tool calling capabilities. Internal benchmarks released last week show that there is still room for improvement in internal code completion tasks, but this has not been reflected synchronously in public evaluations. Combined with the Smoke results, the significant drop in code execution may reflect insufficient robustness of the latest version in specific scenarios.
Meanwhile, the engineering judgment on the side index rose from 30.00 to 58.40, and task expression rose from 10.00 to 30.00, indicating that the model still has progress on non-code tasks. This also confirms that the decline this time is concentrated mainly on the single dimension of code execution, rather than a collapse of overall capability.
Whether it needs key attention?
Single-day Smoke data itself is not statistically significant. It is recommended to continuously track the performance of the same dimension for the next 3-5 days. If the code execution is below 75 points for two consecutive days, and combined with the stability dimension (which currently shows high volatility), only then can it be considered a genuine degradation signal. For now, it is more likely a short-term phenomenon caused by the combination of sampling and version fine-tuning.
Doubao Pro is still in a phase of rapid iteration. An anomaly in a single quick test does not equal a long-term trend. However, for developers relying on its coding capabilities, the continuous evaluation results over the next two weeks will be a more reliable basis for decision-making.
A sharp drop in a quick test often reveals not the model's limit, but our overinterpretation of small-sample fluctuations.
Data source: YZ Index | Run #126 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接