Grok 4 ranks first on the WDCD Compliance Leaderboard with 91.20 points, while Qwen3 Max ranks last with 57.48 points, a gap of 33.72 points between the top and bottom.
Source of Grok 4's Compliance Resilience
Grok 4's score of 91.20 primarily comes from its stable performance on v2 anchor questions, achieving 1.00 in R1, 1.00 in R2, and 1.13/2 in R3, remaining high across all three rounds. This indicates that under continuous pressure, Grok 4 can still maintain most of its constraint memory. In comparison, Gemini 3.1 Pro scored 79.12 on WDCD, with its R3 only reaching 0.63/2, showing that constraints began to loosen after the third round of interference.
Breakdown Path of the Last-Place Qwen3 Max
In Qwen3 Max's 57.48 points, although R1 scored 1.00, R2 dropped to 0.88, and R3 was only 0.38/2, indicating significant forgetting already occurring during the second round of interference. Under worst-of-3 sampling, the worst R3 collapse across the three rounds directly pulled down the overall score. Similarly trailing, Gemini 2.5 Pro scored 59.52 points, with R3 also only 0.50/2, a gap of less than 2 points from Qwen3 Max. Models at the bottom are generally fragile in the R3 phase.
Top Tier and Mid-Range Gap
The top three—Grok 4 (91.20), Gemini 3.1 Pro (79.12), and GPT-o3 (76.60)—form a clear lead. GPT-o3's R2 is only 0.38 and R3 only 0.25/2, indicating low scores in the v3 multi-round progressive pressure phase, dragging down its overall performance. The fourth to seventh positions—Claude Opus 4.7 (72.24), GLM-4.6 (71.84), Claude Sonnet 4.6 (70.00), DeepSeek V4 Pro (67.76)—are densely clustered, with gaps of less than 5 points between them, forming the mid-range group.
Common Characteristics of the Four Bottom Models
The eighth to eleventh positions—GPT-5.5 (60.88), 豆包 Pro (59.68), Gemini 2.5 Pro (59.52), Qwen3 Max (57.48)—all scored below 61 points. They share the common trait that R3 scores generally fall within the 0.25–0.50 range, with constraints difficult to maintain after the third round of pressure. Global statistics show a 16% R3 collapse rate, with these four models contributing the majority of those collapses.
Differentiation Under Five Constraint Scenarios
In data boundary and security compliance scenarios, top models achieve higher S_hold scores and experience later breaches; resource limitation and engineering standard scenarios expose the insufficient S_kbv constraint memory of mid-to-bottom models. In the S_integrity dimension, any instance of breaking constraints while falsely claiming innocence results in a score of 0, further widening the gap between Grok 4 and other models.
In the 25-question pool of the WDCD compliance test, the equally weighted average of v3 multi-round progressive pressure and v2 three-round anchor questions precisely reveals the true performance of models under real conversational pressure.
The results of this pilot phase show that compliance ability is no longer a simple alignment issue, but a sustained survivability throughout multi-round interactions. Grok 4's ability to maintain a score of 91.20 under the most stringent worst-of-3 sampling indicates that its constraint system has a stronger pressure-resistant structure.
Data source: YZ Index WDCD Compliance Leaderboard | Run #211 · Overall Ranking | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接