Qwen3 Max Tops WDCD Compliance Ranking with 70.83 Points, Grok4 Trails with 51.67 Points

The first public ranking of the WDCD compliance test directly shatters the myth that "larger parameters mean greater reliability." Qwen3 Max leads the pack with 70.83 points, while Grok4 trails at 51.67 points. The average crash rate in Phase R3 reaches 60.6%, proving that most models are still highly prone to violating constraints under real enterprise conditions.

Ranking Landscape: R3 Becomes the Sole Dividing Line

Among the 11 models, R1 scores are generally 1.00 or above 0.83, and R2 scores mostly exceed 0.90. The real gap emerges in R3. Qwen3 Max achieves 0.83/2 in R3, followed closely by Claude Sonnet 4.6 and Gemini 3.1 Pro with 0.70 and 0.77 respectively. In contrast, Grok4 scores only 0.17/2 in R3, dragging its total score to the bottom.

This distribution indicates that current large models have learned to superficially comply during the "constraint injection" and "irrelevant topic interference" stages. The real test lies in the third round, where direct pressure is applied. R3 accounts for 50% of the total score; those who can hold firm under high pressure occupy the upper half of the ranking.

Champion Analysis: Where Does Qwen3 Max's 0.83 Come From?

Qwen3 Max passes both R1 and R2 across all five constraint scenarios, losing only 0.17 points in R3. Notably, in the high-risk "safety and compliance" and "data boundary" scenarios, the model consistently refuses user requests to modify original rules, maintaining output consistency. In comparison, GPT-o3 also achieves perfect scores in R1 and R2, but only scores 0.63 in R3, exposing its vulnerability under engineering-standard constraints.

Only 21.2% of models achieve a perfect score, meaning fewer than one in four models can complete all three rounds with zero violations.

Trailing Model: What Does Grok4's 0.17 Reveal?

Grok4 performs reasonably well in R1 and R2, but nearly collapses entirely in R3. A typical case is the "resource limitation" scenario: the user requests to exceed quotas citing "urgent business needs," and the model directly agrees after two follow-up questions. In the same scenario, Qwen3 Max and Claude Sonnet 4.6 both persistently refuse, revealing a clear generational gap in the persistence of system prompts.

Gap Between Frontrunners and Tailenders: Not Decimal Points, But Scenario-Level Disparities

The top four models all score above 65, while the bottom four fall below 60. The gap is not evenly distributed but concentrated in the "business rules" and "engineering standards" scenarios, which are common in real enterprise settings. Although 豆包 Pro and 文心一言 4.5 improved by 11.7 points and 10.0 points respectively in this edition, their R3 scores remain in the 0.63 and 0.47 range, still half a point behind the leaders.

  • Average R3 score: frontrunners 0.73, tailenders 0.38
  • Number of violations in safety/compliance scenarios: tailenders 2.8 times that of frontrunners
  • Only 36% of evaluated models can still comply after two consecutive rounds of interference

This means that when enterprises integrate models into real workflows, tailenders are likely to break established boundaries under high pressure or inducements, introducing compliance risks.

Implicit Signals from Comparison with Previous Edition

Gemini 3.1 Pro surged by 14.2 points this edition, mainly due to improvements in R3. Claude Opus 4.7 also advanced by 6.7 points, indicating that Anthropic and Google are continuously iterating on system-level constraint persistence. In contrast, Grok4 shows no significant improvement this edition, with R3 remaining at a low level, suggesting that its defense mechanisms against "direct pressure" attacks have not been effectively upgraded.

The pilot phase is not included in the main ranking, but it clearly outlines the next battleground for models: how to treat initial constraints as inviolable hard rules—rather than negotiable suggestions—after multiple rounds of dialogue.

Compliance capability is no longer a nice-to-have; it is the core threshold for models to truly enter enterprise production environments.


Data source: YZ Index WDCD Compliance Ranking | Run #140 · Overall Ranking | Evaluation Methodology