Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day

Jun 21, 2026 29 Views - Read Source Winzheng Index

Qwen3 Max Code Execution Smoke Test Model Fluctuations 主榜排名

In the YZ Index's June 2026 test of 11 models, Qwen3 Max's main score dropped from 100 points yesterday to 80.82 points today, a decrease of 19.2 points.

Core Dimension Breakdown

The code execution dimension dropped from 100.00 to 68.80 points, a decrease of 31.2 points, which is the main source of the main score decline. The material constraint dimension dropped from 100.00 to 95.50 points, a decrease of only 4.5 points. The main score is composed of a weighted combination of code execution and material constraints, so the sharp decline in the execution side directly dragged down the overall score.

Engineering judgment dropped from 66.70 to 44.50 points, and task expression dropped from 97.50 to 62.50 points. Both side scores (AI-assisted evaluation) showed significant declines, but they are not included in the main score ranking.

Cause Analysis of Fluctuation

The Smoke evaluation only has 10 questions per day, 2 questions per dimension. The sample size is small, and question draw fluctuations can cause large score variations. The code execution dimension lost 31.2 points in a single day, far exceeding the 4.5 points of material constraints, indicating that this test may have drawn questions that are more challenging for Qwen3 Max's current reasoning path.

If similar declines occur for multiple consecutive days, real model degradation should be considered. However, this is only a single day of data, so it cannot be directly determined as a capability decline.

Whether Continued Attention is Needed

The code execution dimension has dropped by 31.2 points, far exceeding the decline in material constraints. It is recommended to closely track the score of this dimension in the Smoke evaluation over the next 3-5 days. If the execution score remains below 80 points, it may reflect stability issues of the model in specific code scenarios.

The integrity rating remains pass, indicating that the model has not exhibited refusal to answer or obvious boundary-crossing behavior.

The current data only supports the conclusion of "abnormal single-day execution fluctuation", and it has not yet reached a level that requires adjusting the long-term ranking.

A 31.2-point execution drop in a single Smoke test is more like a lottery draw than a signal of model degradation.

Data source: YZ Index | Run #190 | View raw data

Qwen3 Max Main Score Plummets 19.2 Points, Code Execution Drops 31.2 Points in a Single Day

Core Dimension Breakdown

Cause Analysis of Fluctuation

Whether Continued Attention is Needed

Related Reviews

Winzheng Index 豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

Winzheng Index Qwen3 Max Material Constraint Plummets 28.9 Points, Today's Smoke 11 Model Main Leaderboard Reshuffles

Winzheng Index Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Winzheng Index Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points