In the June 2026 Smoke evaluation of 11 models by the YZ Index, Qwen3 Max scored 68.80 in material constraint today, down 26.7 points from yesterday's 95.50, while its code execution score rose to 100.00.
Single-Day Volatility Magnitude and Dimension Breakdown
The Smoke evaluation uses only 10 questions per day, with 2 questions per dimension, so the single-day standard deviation is naturally large. Qwen3 Max's code execution rose by +31.2 points, task expression by +25 points, and engineering judgment by +18.7 points. After the combined positive changes offset the material constraint decline, the main leaderboard still increased by a net 5.1 points to 85.96. Integrity rating remains pass, no threshold triggered.
The material constraint dimension experienced a -26.7 point drop from just 2 questions, indicating that at least one question triggered a clear material boundary violation or format error. The full score in code execution shows the model strictly followed instructions and output correct code in another set of questions.
Random Variance or True Degradation?
If caused by question lottery, the probability of a high-difficulty constraint violation in 2 material constraint questions is about 30%-40%, which is within normal range. True degradation requires persistent low scores in the same dimension over consecutive days or repeated violations on similar questions. Current single-day data is insufficient to confirm degradation.
The fact that the main leaderboard score increased indicates no systemic decline in the model's overall output ability. The simultaneous improvement in engineering judgment and task expression—two side-bar indicators—also suggests that the model has maintained or slightly improved in instruction following and structured output.
Should Continuous Monitoring Be Required?
A single-day material constraint decline of -26.7 points falls within the common fluctuation range of Smoke testing and does not constitute an immediate alert. It is recommended to observe scores in the same dimension over three consecutive trading days. If material constraint falls below 75 points for two consecutive days and the standard deviation remains higher than the current day's level, then initiate an in-depth retest.
Current data does not support the conclusion of "model degradation." Qwen3 Max remains in the upper-middle range of the main leaderboard overall.
A 26.7-point drop in a single Smoke test is more likely a question lottery than the model itself collapsing.
Data source: YZ Index | Run #191 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接