Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

In today's YZ Index Smoke evaluation, Qwen3 Max's main score dropped from 85.96 to 74.00, a decrease of 12 points.

Dimensional Breakdown: Clear Polarization

The code execution dimension remained unchanged at 100.00, while the material constraint dimension rose sharply from 68.80 to 95.70, an increase of 26.9 points. The decline in the main score primarily came from the sideboard engineering judgment dropping from 63.20 to 48.40, and task expression dropping from 87.50 to 68.80. The integrity rating simultaneously changed from pass to fail.

Analysis of Volatility Sources

The Smoke evaluation only has 10 questions per day, with 2 questions per dimension, so the daily score is significantly affected by the random selection of questions. Qwen3 Max's two main score dimensions—code execution and material constraint—were either stable or increased, indicating no systematic degradation in the model's core auditable capabilities. The decline in engineering judgment and task expression is more likely a short-term fluctuation caused by changes in the probability of specific question types being selected.

However, the integrity rating directly changed from pass to fail, exceeding the normal range of random fluctuation. This rating serves as an access threshold; once fail is triggered, it usually indicates a clear issue with the model's consistency or compliance, which should be distinguished from simple score fluctuations.

Whether Continuous Monitoring Is Needed

A single day of Smoke data is insufficient to determine real degradation of the model, but the change in integrity rating already constitutes a clear signal. It is recommended to continuously observe the standard deviation of scores for the same model in the same dimensions over the next 3-5 days. If the main score consistently remains below 80 and the integrity rating stays at fail, a formal weekly review retest should be initiated.

Currently, Qwen3 Max is more likely facing a combination of highly volatile questions rather than a cliff-like drop in its capabilities. Users can still prioritize its stable code execution performance of 100.00 points when calling it in production environments.


Data source: YZ Index (YZ Index) | Run #194 | View raw data