Qwen3 Max tops WDCD Compliance Leaderboard with 84.38 points, GPT-o3 at bottom with 67.19 points, a gap of 17 points

Qwen3 Max tops the WDCD Compliance Leaderboard with 84.38 points, while GPT-o3 ranks last with 67.19 points, a difference of 17.19 points.

Ranking Landscape: Tight at the Top, Gap at the Bottom

This edition of the WDCD Leaderboard shows a clear tiered distribution. The top three—Qwen3 Max, Grok 4, and Gemini 3.1 Pro—scored 84.38, 82.03, and 79.69 respectively, with gaps within 2 points. Positions fifth to seventh—Claude Sonnet 4.6, DeepSeek V4 Pro, and GPT-5.5—all stand at 75.78, forming a plateau. The tenth-place Doubao Pro (67.97) and eleventh-place GPT-o3 (67.19) differ by only 0.78 points, yet are nearly 17 points behind the leader.

Winner Analysis: Qwen3 Max's R3 Score of 1.59 Seizes Advantage

Qwen3 Max scored 1.59 in the R3 phase, higher than Grok 4's 1.44 and Gemini 3.1 Pro's 1.47. It achieved a perfect 1.00 in R1, 0.78 in R2, and its R3 score reached nearly 80% of the maximum 2 points, demonstrating the strongest constraint maintenance ability across the three stress tests. In comparison, the gap between the top two is only 2.35 points, but the R3 lead of 0.15 points indicates better anti-interference performance under direct pressure.

Bottom Model: GPT-o3's R3 of Only 0.84 Exposes Biggest Weakness

GPT-o3 scored 0.84 in R3, the lowest among the 11 models. It achieved 1.00 in R1 and 0.84 in R2, but collapsed in R3, falling 0.16 points below the second-lowest, Claude Opus 4.7 (R3=1.00). Global statistics show a 25% failure rate in R3; GPT-o3's 0.84 directly reflects this proportion, indicating the weakest constraint persistence in business rules and security compliance scenarios.

Gap Between Top and Bottom: R3 Weight Determines Final Ranking

The average R3 score of the top three is 1.50, while that of the bottom three is only 1.06, a gap of 0.44 points. Since R3 carries a weight of 2 points, this phase amplifies overall score differences. The R3 score difference between Qwen3 Max and GPT-o3 is 0.75 points, which accounts for 43% of the 17.19-point gap between first and last. In the R2 phase, Grok 4 scored the highest at 0.84, while Qwen3 Max scored only 0.78, indicating that Grok 4 performed more stably under irrelevant topic distractions, but its R3 decline resulted in finishing behind Qwen3 Max.

Comparison with Previous Edition: Qwen3 Max Up by 17.2 Points, Tops Improvement

All 11 models improved compared to the previous edition. Qwen3 Max improved by 17.2 points, Claude Opus 4.7 by 16.4 points, and GPT-o3 by 15.6 points. Among the top three improvers, Qwen3 Max's R3 phase improvement was most significant, directly propelling it from a possible mid-tier position to first place. Doubao Pro improved only by 5.5 points, the smallest increase, and its R1 score remains at 0.63, indicating a persistent weakness in the initial constraint injection phase.

A perfect score rate of 37.8% means only about 40% of the models maintained constraints across all 32 questions; most models exhibited varying degrees of violation in engineering specifications or data boundary scenarios. The R3 weight accounts for 50% of the total score, and a 25% failure rate further confirms that this phase is the core indicator for distinguishing model compliance ability.

Under three progressive stress levels, the R3 score has become the key variable determining WDCD rankings.

Data source: YZ Index WDCD Compliance Leaderboard | Run #171 · Overall Ranking | Evaluation Methodology