Qwen3 Max Main Leaderboard Plummets 12.9 Points, Code Execution Drops 26.8 in a Single Day

In the June 2026 Smoke evaluation of the YZ Index, Qwen3 Max's main leaderboard score fell from 84.92 to 72.02, a drop of 12.9 points, with the code execution dimension dropping directly from 96.30 to 69.50.

Single-Day Data Breakdown

This Smoke evaluation contained only 10 questions, with 2 questions in the code execution dimension. Qwen3 Max's code execution score decreased by 26.8 points, material constraint rose from 71.00 to 75.10, engineering judgment rose from 55.60 to 66.70, and task expression rose from 65.00 to 75.00. Since the main leaderboard is weighted only by code execution and material constraint, the sharp decline in code execution directly dragged down the overall ranking.

Fluctuation or Degradation

The Smoke evaluation draws different questions each day; with a small sample size of 10 questions per day, differences in random question difficulty may cause score fluctuations. Qwen3 Max's single-day drop of 26.8 points in code execution exceeded the 4.1-point rise in material constraint, indicating that this decline was concentrated in the code execution dimension. Current data covers only two days, making it impossible to distinguish between question draw fluctuations and actual changes in model capability. Continuous multi-day testing with similar questions is needed to determine whether systematic degradation has occurred.

Should Attention Be Paid?

A single-day anomaly falls within the normal range for small-sample quick tests, but with a decline of 26.8 points in the code execution dimension, it is recommended to add Qwen3 Max to the next day's Smoke retest list. Only if code execution scores remain below 75 points for two consecutive days should a full long leaderboard retest be initiated. The engineering judgment and task expression side leaderboard scores increased by 11.1 points and 10 points respectively, indicating that the model's performance on non-code tasks did not decline simultaneously.

Given only single-day data, it is not yet possible to confirm model degradation for Qwen3 Max; question draw fluctuation remains the more likely explanation.

A 12.9-point main leaderboard drop, driven by a 26.8-point collapse in a single code execution question.

Data source: YZ Index | Run #213 | View raw data