Qwen3 Max Scores 92.50 to Top WDCD Commitment Ranking; Doubao Pro 62.50 Ranks Last with 30-Point Gap

Qwen3 Max scored 92.50 to top the WDCD Commitment Ranking, leading second-place Claude Sonnet 4.6's 90.00 by 2.5 points. Meanwhile, Doubao Pro ranked last among the 11 evaluated models with a score of 62.50, trailing the champion by 30 points.

Ranking Landscape: Concentration at the Top, Stalemate in the Middle, Gap at the Bottom

This WDCD ranking shows a clear three-tier distribution. The top four—Qwen3 Max, Claude Sonnet 4.6, DeepSeek V4 Pro, and Claude Opus 4.7—all scored above 85 points, forming the first echelon. Among them, Qwen3 Max's R3 score of 1.90/2 was the highest across all models, indicating its ability to maintain a high constraint adherence rate even under direct pressure.

The fifth to ninth positions are clustered in the 77.5–82.5 point range: Wenxin Yiyan 4.5 and Grok 4 both at 82.50, Gemini 2.5 Pro and Gemini 3.1 Pro both at 80.00, and GPT-5.5 at 77.50. R2 scores in this range generally fall between 0.7 and 0.8, showing that models have already exhibited varying degrees of loosening during irrelevant topic interference.

The bottom two—GPT-o3 with 70.00 and Doubao Pro with 62.50—form a clear gap. Doubao Pro's R1 is only 0.70 and its R3 only 1.20/2, indicating it failed to fully establish rule boundaries even during the initial constraint injection phase.

Champion Analysis: How Did Qwen3 Max Achieve Its R3 Score of 1.90?

In the three rounds of testing, Qwen3 Max scored R1=1.00, R2=0.80, and R3=1.90/2, with all three scores ranking among the top. Notably, its R3 score is 0.60 points higher than the ninth-place GPT-5.5 and 1.00 points higher than the tenth-place GPT-o3. This indicates that under direct pressure from business rules and safety compliance scenarios, Qwen3 Max can still maintain a high proportion of constraint adherence.

Last-Place Analysis: What Weaknesses Does Doubao Pro's 62.50 Reveal?

Doubao Pro's three-round scores are R1=0.70, R2=0.60, and R3=1.20/2, with R1 and R3 both ranking last. Its R1 score is below the average, indicating deficiencies already present during the initial constraint establishment phase. Its R3 score of only 1.20/2 is 0.70 points lower than Qwen3 Max, reflecting a greater tendency to break constraints under engineering specifications and resource limitation scenarios.

Gap Between Top and Bottom Tiers: The Composition of the 30-Point Difference

The average score of the four models in the first tier is 88.75 points, while the average score of the two models at the bottom is 66.25 points—a gap of 22.5 points. If we compare only the champion and the last-place model, the gap reaches 30 points. This difference is concentrated primarily in the R3 dimension: Qwen3 Max R3=1.90, Doubao Pro R3=1.20, a single-dimension gap of 0.70 points, accounting for 70% of the total score difference.

Global statistics show a full-score rate of 47.3% and an R3 collapse rate of 16.4%. This means that in the direct pressure round, more than one-sixth of test cases still exhibit constraint violations, with the bottom-tier models contributing significantly to this outcome.

Comparison with the Previous Period: Claude Opus 4.7 Shows the Largest Increase

Claude Opus 4.7 improved by 15.5 points compared to the previous period, Claude Sonnet 4.6 by 14.2 points, and DeepSeek V4 Pro by 11.7 points. Qwen3 Max also rose by 8.1 points. The only model that declined was Doubao Pro, dropping 5.5 points, further widening the gap with the top tier.

The 30-point gap is primarily determined by the R3 dimension, with Qwen3 Max's 1.90 score in the pressure phase becoming the strongest current evidence of constraint adherence.

Data from this pilot phase indicates that models generally perform well in the initial constraint establishment stage, but differences rapidly widen when faced with continuous interference and direct pressure. The score advantages of Qwen3 Max and Claude Sonnet 4.6 in the R3 dimension may stem from stricter internal alignment mechanisms.


Data source: YZ Index WDCD Commitment Ranking | Run #185 · Overall Ranking | Evaluation Methodology