Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

Jul 3, 2026 25 Views - Read Source Winzheng Index

WDCD Compliance Test 模型守约能力 Grok 4 Qwen3 Max

Grok 4 ranks first on the WDCD Compliance Leaderboard with 91.20 points, while Qwen3 Max ranks last with 57.48 points, a gap of 33.72 points between the top and bottom.

Source of Grok 4's Compliance Resilience

Grok 4's score of 91.20 primarily comes from its stable performance on v2 anchor questions, achieving 1.00 in R1, 1.00 in R2, and 1.13/2 in R3, remaining high across all three rounds. This indicates that under continuous pressure, Grok 4 can still maintain most of its constraint memory. In comparison, Gemini 3.1 Pro scored 79.12 on WDCD, with its R3 only reaching 0.63/2, showing that constraints began to loosen after the third round of interference.

Breakdown Path of the Last-Place Qwen3 Max

In Qwen3 Max's 57.48 points, although R1 scored 1.00, R2 dropped to 0.88, and R3 was only 0.38/2, indicating significant forgetting already occurring during the second round of interference. Under worst-of-3 sampling, the worst R3 collapse across the three rounds directly pulled down the overall score. Similarly trailing, Gemini 2.5 Pro scored 59.52 points, with R3 also only 0.50/2, a gap of less than 2 points from Qwen3 Max. Models at the bottom are generally fragile in the R3 phase.

Top Tier and Mid-Range Gap

The top three—Grok 4 (91.20), Gemini 3.1 Pro (79.12), and GPT-o3 (76.60)—form a clear lead. GPT-o3's R2 is only 0.38 and R3 only 0.25/2, indicating low scores in the v3 multi-round progressive pressure phase, dragging down its overall performance. The fourth to seventh positions—Claude Opus 4.7 (72.24), GLM-4.6 (71.84), Claude Sonnet 4.6 (70.00), DeepSeek V4 Pro (67.76)—are densely clustered, with gaps of less than 5 points between them, forming the mid-range group.

Common Characteristics of the Four Bottom Models

The eighth to eleventh positions—GPT-5.5 (60.88), 豆包 Pro (59.68), Gemini 2.5 Pro (59.52), Qwen3 Max (57.48)—all scored below 61 points. They share the common trait that R3 scores generally fall within the 0.25–0.50 range, with constraints difficult to maintain after the third round of pressure. Global statistics show a 16% R3 collapse rate, with these four models contributing the majority of those collapses.

Differentiation Under Five Constraint Scenarios

In data boundary and security compliance scenarios, top models achieve higher S_hold scores and experience later breaches; resource limitation and engineering standard scenarios expose the insufficient S_kbv constraint memory of mid-to-bottom models. In the S_integrity dimension, any instance of breaking constraints while falsely claiming innocence results in a score of 0, further widening the gap between Grok 4 and other models.

In the 25-question pool of the WDCD compliance test, the equally weighted average of v3 multi-round progressive pressure and v2 three-round anchor questions precisely reveals the true performance of models under real conversational pressure.

The results of this pilot phase show that compliance ability is no longer a simple alignment issue, but a sustained survivability throughout multi-round interactions. Grok 4's ability to maintain a score of 91.20 under the most stringent worst-of-3 sampling indicates that its constraint system has a stronger pressure-resistant structure.

Data source: YZ Index WDCD Compliance Leaderboard | Run #211 · Overall Ranking | Evaluation Methodology

Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

Source of Grok 4's Compliance Resilience

Breakdown Path of the Last-Place Qwen3 Max

Top Tier and Mid-Range Gap

Common Characteristics of the Four Bottom Models

Differentiation Under Five Constraint Scenarios

Related Reviews

Winzheng Index Qwen3 Max Scores 92.50 to Top WDCD Commitment Ranking; Doubao Pro 62.50 Ranks Last with 30-Point Gap

Winzheng Index Qwen3 Max tops WDCD Compliance Leaderboard with 84.38 points, GPT-o3 at bottom with 67.19 points, a gap of 17 points

Winzheng Index R3 Collapsed 168 Times! Claude Opus 0.34 vs Grok 1.22: Three-Round Real Decay in Commitment

Winzheng Index Grok 4 Leads with 74.22 Points, GPT-o3 Trails at 51.56 Points — WDCD Gap of 22.66