Qwen3 Max Main Index Plummets 10.9 Points, Code Execution Halved by 25 Points in a Single Day

Qwen3 Max's main index dropped directly by 10.9 points in today's Smoke test, with the core reason being that the code execution dimension fell from a perfect score of 100 to 75. This single-day change has exceeded the normal range of draw fluctuations and requires serious attention.

Data Breakdown: Why Code Execution Was Halved

The Smoke evaluation includes only 2 code execution questions per day. Yesterday's perfect score means the model gave correct and efficient execution paths on both questions; today's 75 points typically corresponds to one question showing significant errors or efficiency issues. The material constraint dimension actually rose from 75 to 81.3, indicating that the model hasn't systematically degraded in constraint following, with the problem concentrated in code generation and execution capabilities.

Question Fluctuation or Actual Degradation

Randomness in the daily 10-question quick test does exist, but the code execution dimension has seen a 25-point drop for two consecutive days, which can no longer be simply attributed to question difficulty. Fluctuations of this scale within a single day are relatively rare in previous Qwen series evaluations. More notably, the integrity rating has directly switched from pass to warn, which typically means the model exhibited verifiable logic or factual issues in its answers.

Engineering judgment (side index, AI-assisted evaluation) rose from 30 to 50, while task expression remained unchanged at 30. The fact that both side indices did not weaken simultaneously further focuses the issue on the core capability of code execution.

Recent Industry Trends and Possible Triggers

Alibaba has recently conducted multiple rounds of alignment and safety reinforcement training on the Qwen3 series. Some developers have reported that the model has become more conservative when following complex instructions, with decreases in code generation length and tool invocation count. This conservatism may directly lead to errors in handling execution efficiency and boundary conditions in the Smoke evaluation.

  • The post-training model tends to output "safe but not sufficiently aggressive" code solutions
  • Some questions in the Smoke evaluation require aggressive optimization or edge-case handling, easily exposing weaknesses
  • The integrity rating of warn suggests possible hallucinations or logical leaps, further amplifying score deductions

Should Close Attention Be Paid?

Yes. Code execution is one of the two auditable dimensions of the main index. A single-day drop of 25 points, coupled with a downgrade in integrity rating, constitutes a clear signal. It is recommended to observe the Smoke data continuously for 3-5 days. If the code execution dimension cannot recover to above 90, it should be considered that the model's true capability has undergone a phased degradation.

The stability dimension has not yet disclosed specific values, but based on today's performance, the model's output consistency on similar questions may have declined. This is highly consistent with "over-conservatism" caused by excessive alignment intensity during post-training.

If Qwen3 Max cannot restore its code execution level next week, developer community confidence in its positioning as the "strongest open-source code model" will be further shaken.


Data source: YZ Index | Run #121 | View Raw Data