豆包Pro showed a significant anomaly in today's Smoke test, with its Material Constraint score dropping directly from 95 yesterday to 79.8, a single-day decline of 15.2 points. The overall main ranking fell from 97.75 to 90.91 points. Such a change is uncommon in a quick test of only 10 questions per day and deserves special attention.
Source of Volatility: Question Sampling or Capability Degradation
The Smoke test randomly selects 2 questions each day to test Material Constraint. With a small sample size, daily score fluctuations are within normal range. However, a drop of 15.2 points exceeds the historical average fluctuation range. Yesterday's Material Constraint score of 95 corresponded to strong citation accuracy and anti-hallucination capability. Today's 79.8 indicates that the model exhibited more instances of not answering according to the given material or over-extrapolating when handling tasks with provided materials.
Another possibility is short-term degradation of the model itself. ByteDance has recently undergone multiple rounds of iteration on the 豆包 series, focusing on enhancing multimodal and long-text capabilities. If adjustments to the underlying alignment strategy affect the priority of material adherence, it could manifest as a drop in grounding scores in a short period.
Supporting Data from Side Dimensions
Notably, in the same test, Engineering Judgment rose from 50 to 66.7 points, and Task Expression rose from 30 to 50 points. The improvement in these two side dimensions indicates that the model has not declined overall in reasoning chain and expression organization. Code Execution continued to maintain a perfect score of 100, further ruling out the possibility of large-scale capability collapse.
Overall, it is more likely an amplification of randomness due to question sampling rather than sustained degradation. However, if similar fluctuations in grounding occur for two consecutive days, vigilance should be raised.
Whether Special Attention Is Needed
Currently, single-day data is insufficient to determine that the model has entered a degradation path. It is recommended to observe the trend of the same dimension over 3-5 consecutive days. If Material Constraint remains below 85 points and the standard deviation expands, then initiate an in-depth retest. In the short term, when users use 豆包Pro for question-answering with provided materials, they may add an additional manual verification step.
A single-day 15-point fluctuation does not equal capability collapse, but consecutive fluctuations are an alarm.
Data source: YZ Index (YZ Index) | Run #123 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接