Qwen3 Max Code Execution Plunges 50 Points, Main Ranking Only Drops 1.5 Points

Jun 24, 2026 29 Views - Read Source Winzheng Index

Qwen3 Max Code Execution 烟雾测试主榜波动 Material Constraints

In the June 2026 YZ Index evaluation of 11 models, Qwen3 Max's code execution score dropped directly from 100.00 yesterday to 50.00 today, a single-day decline of 50 points.

The True Composition of the Slight Drop in the Main Ranking

The main ranking score only dropped from 74.00 to 72.50, a decline of 1.5 points. This is because the main ranking is calculated solely as the average of two dimensions: code execution and material adherence. Material adherence rose from 95.70 to 100.00, offsetting the sharp decline in code execution.

Engineering judgment rose from 48.40 to 63.20, task expression rose from 68.80 to 96.30, and the integrity rating changed from fail to pass. These improvements in side-ranking metrics were not included in the main ranking.

Is a 50-Point Fluctuation a Draw or Degradation?

The Smoke test only includes 10 questions per day, with 2 questions per dimension. If the code execution dimension's daily questions are concentrated on complex multi-step reasoning or edge cases, a 50-point level fluctuation in a single day's score is within the normal range. Yesterday's 100.00 means all questions were correctly answered, while today's 50.00 may mean only half were completed.

Material adherence reached a perfect score on the same day, indicating no systematic decline in the model's constraint-following ability. The opposite movements in the two core dimensions are more consistent with random fluctuations caused by question draw rather than an overall degradation of the model's capabilities.

Whether Continuous Attention Is Needed

Single-day 50-point level fluctuations have occurred multiple times in Smoke quick tests. If the code execution score remains below 70 points for the following three days, a change in the model's true capability should be considered. Currently, based on only one day's data, degradation cannot be confirmed.

The integrity rating changing from fail to pass indicates that the model did not show obvious hallucinations or boundary-crossing responses in this quick test, contrasting with the sharp drop in code execution score and further supporting that the fluctuation mainly stems from question difficulty rather than the model itself.

The 50-point level single-day fluctuation in Smoke quick tests reflects more draw variance than model degradation.

Data source: YZ Index (YZ Index) | Run #195 | View raw data

Qwen3 Max Code Execution Plunges 50 Points, Main Ranking Only Drops 1.5 Points

The True Composition of the Slight Drop in the Main Ranking

Is a 50-Point Fluctuation a Draw or Degradation?

Whether Continuous Attention Is Needed

Related Reviews

Winzheng Index Qwen3 Max Plunges 19.2 Points on Main Leaderboard; Four Models Score Perfect in Execution and Constraint

Winzheng Index Qwen3 Max Material Constraint Plummets 28.9 Points, Today's Smoke 11 Model Main Leaderboard Reshuffles

Winzheng Index 4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail