In the YZ Index's June 2026 test of 11 models, Qwen3 Max's main score dropped from 100 points yesterday to 80.82 points today, a decrease of 19.2 points.
Core Dimension Breakdown
The code execution dimension dropped from 100.00 to 68.80 points, a decrease of 31.2 points, which is the main source of the main score decline. The material constraint dimension dropped from 100.00 to 95.50 points, a decrease of only 4.5 points. The main score is composed of a weighted combination of code execution and material constraints, so the sharp decline in the execution side directly dragged down the overall score.
Engineering judgment dropped from 66.70 to 44.50 points, and task expression dropped from 97.50 to 62.50 points. Both side scores (AI-assisted evaluation) showed significant declines, but they are not included in the main score ranking.
Cause Analysis of Fluctuation
The Smoke evaluation only has 10 questions per day, 2 questions per dimension. The sample size is small, and question draw fluctuations can cause large score variations. The code execution dimension lost 31.2 points in a single day, far exceeding the 4.5 points of material constraints, indicating that this test may have drawn questions that are more challenging for Qwen3 Max's current reasoning path.
If similar declines occur for multiple consecutive days, real model degradation should be considered. However, this is only a single day of data, so it cannot be directly determined as a capability decline.
Whether Continued Attention is Needed
The code execution dimension has dropped by 31.2 points, far exceeding the decline in material constraints. It is recommended to closely track the score of this dimension in the Smoke evaluation over the next 3-5 days. If the execution score remains below 80 points, it may reflect stability issues of the model in specific code scenarios.
The integrity rating remains pass, indicating that the model has not exhibited refusal to answer or obvious boundary-crossing behavior.
The current data only supports the conclusion of "abnormal single-day execution fluctuation", and it has not yet reached a level that requires adjusting the long-term ranking.
A 31.2-point execution drop in a single Smoke test is more like a lottery draw than a signal of model degradation.
Data source: YZ Index | Run #190 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接