In the June 2026 YZ Index evaluation of 11 models, Qwen3 Max's code execution score dropped directly from 100.00 yesterday to 50.00 today, a single-day decline of 50 points.
The True Composition of the Slight Drop in the Main Ranking
The main ranking score only dropped from 74.00 to 72.50, a decline of 1.5 points. This is because the main ranking is calculated solely as the average of two dimensions: code execution and material adherence. Material adherence rose from 95.70 to 100.00, offsetting the sharp decline in code execution.
Engineering judgment rose from 48.40 to 63.20, task expression rose from 68.80 to 96.30, and the integrity rating changed from fail to pass. These improvements in side-ranking metrics were not included in the main ranking.
Is a 50-Point Fluctuation a Draw or Degradation?
The Smoke test only includes 10 questions per day, with 2 questions per dimension. If the code execution dimension's daily questions are concentrated on complex multi-step reasoning or edge cases, a 50-point level fluctuation in a single day's score is within the normal range. Yesterday's 100.00 means all questions were correctly answered, while today's 50.00 may mean only half were completed.
Material adherence reached a perfect score on the same day, indicating no systematic decline in the model's constraint-following ability. The opposite movements in the two core dimensions are more consistent with random fluctuations caused by question draw rather than an overall degradation of the model's capabilities.
Whether Continuous Attention Is Needed
Single-day 50-point level fluctuations have occurred multiple times in Smoke quick tests. If the code execution score remains below 70 points for the following three days, a change in the model's true capability should be considered. Currently, based on only one day's data, degradation cannot be confirmed.
The integrity rating changing from fail to pass indicates that the model did not show obvious hallucinations or boundary-crossing responses in this quick test, contrasting with the sharp drop in code execution score and further supporting that the fluctuation mainly stems from question difficulty rather than the model itself.
The 50-point level single-day fluctuation in Smoke quick tests reflects more draw variance than model degradation.
Data source: YZ Index (YZ Index) | Run #195 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接