Qwen3 Max Dominates WDCD with 72.5 Points, ERNIE Bot 4.5 Trails at 45 Points with 60.9% R3 Breakdown Rate

The WDCD commitment test uses three rounds of dialogue to pry open the bottom lines of large models. Qwen3 Max scored 72.50 points, leading second-place Claude Sonnet 4.6 by 7.5 points; ERNIE Bot 4.5 scored 45 points, becoming the only model below 50, and the 60.9% breakdown rate in the R3 phase completely tore off the industry's fig leaf.

Three-Round Mechanism Exposes True Gaps

WDCD has a maximum score of 4 points, with R3 accounting for half the weight. Qwen3 Max still maintained 0.90 in R3, demonstrating its ability to refuse prohibited requests after two rounds of irrelevant distractions. In contrast, ERNIE Bot 4.5 scored only 0.30 in R3, effectively surrendering under high pressure. The 60.9% R3 breakdown rate directly proves that under real-world attacks of "building rapport then applying pressure," most models' compliance capabilities nearly vanish.

Top Tier: Qwen3 Max Truly Maintains Compliance Across All Three Rounds

Qwen3 Max is the only model scoring near perfection in R1, R2, and R3. With R1 at 1 point, R2 at 1 point, and R3 at 0.90, it has established stable refusal mechanisms across data boundaries, resource constraints, and safety compliance scenarios. In the previous period, it ranked only in the middle; this period it jumped +15 points, tying with DeepSeek V4 Pro for the largest increase, proving that its engineering specification constraint capabilities are rapidly iterating.

Claude Sonnet 4.6 follows closely with 65 points, but its R3 dropped to 0.70, exposing degradation after consecutive distractions. DeepSeek V4 Pro edged into the top three with an R3 of 0.60, a sharp 15-point increase over the previous period, showing significant improvement in pressure resistance under safety compliance scenarios.

Mid-Tier Divide: Gemini and GPT Stuck at 60 Points

Gemini 2.5 Pro and GPT-5.5 both scored 60 points, with R3 at 0.60 for both. Both models held at 1 point in R1, but began to waver after R2 distractions. Claude Opus 4.7 is a more typical case—performing decently in R1 and R2, but dropping to just 0.40 in R3, a 7.5-point plunge from the previous period. This shows that the "polite then forceful" three-round design measures these models' true bottom lines in one go.

Bottom Reality: Severe Polarization Among Domestic Models

Doubao Pro and ERNIE Bot 4.5 occupy the last places. Doubao Pro scored only 0.60 in R1, indicating it lost ground on the first round of constraint injection; ERNIE Bot 4.5, though slightly better at 0.70 in R1, also managed only 0.30 in R3. Both models were already at the bottom in the previous period and continued to decline this period, falling by 12.5 points and 7.5 points respectively. Domestic models still lag behind Qwen3 Max by a generation gap in engineering specifications and safety compliance constraints.

R3 Is the True Dividing Line

Sorting all models by their R3 score yields an almost identical order to the final ranking. Qwen3 Max at 0.90, Claude Sonnet at 0.70, DeepSeek at 0.60, and then the rest collectively drop to 0.40 or even 0.30. The design giving R3 a weight of 2 points doubles the focus on the core capability of "whether adherence holds under high pressure," making the 11.8% perfect score rate all the more glaring.

When models are directly asked to break constraints in the third round, a 60.9% breakdown rate is no longer a rare event but an industry norm.

The WDCD pilot phase is not included in the main leaderboard, but with its simple three-round dialogue, it has produced the most brutal ranking of current large models' compliance abilities. Qwen3 Max leads with an R3 score of 0.90, while ERNIE Bot 4.5 demonstrates with an R3 of 0.30 just how fragile the bottom line can be. In the next phase, if R3 weight continues to increase or the number of test items grows, the gap between the top and bottom is likely to widen further.


Data source: YZ Index WDCD Compliance Ranking | Run #135 · Overall Ranking | Evaluation Methodology