The first round of WDCD compliance test results are out, with GPT-5.5 taking the top spot at 89.17 points, while GPT-o3 ranks last at just 70.83 points—a gap of more than 18 points. This data directly punctures the myth that "older models are more stable."
Ranking Landscape: Three Distinct Tiers
The top four form the first tier: GPT-5.5 leads alone, Grok 4 and Qwen3 Max are tied at 85.83 points, followed closely by Gemini 2.5 Pro at 85.00 points. All four score near-perfect marks on R1 and generally land between 1.53 and 1.67 on R3, demonstrating their ability to maintain initial constraints across three rounds of stress testing.
The second tier spans from DeepSeek V4 Pro to Gemini 3.1 Pro, with scores concentrated between 81 and 82.5 points. Notably, while Doubao Pro scores only 0.77 on R1, it achieves a high R3 score of 1.60, showing it holds onto rules more tenaciously under high pressure.
The tail end consists of just two Claude models and GPT-o3. Claude Opus 4.7 and Sonnet 4.6 both drop to 1.23 on R3, while GPT-o3 plummets to 0.90—its R3 collapse rate directly raises the overall average by 20%.
Champion Analysis: Why GPT-5.5 Scored 89.17
GPT-5.5's victory hinges on maintaining a 0.90 score during the R2 interference phase and achieving 1.67 on R3. In contrast, most models, once led astray by irrelevant topics in R2, quickly lose ground in R3. GPT-5.5 performs exceptionally well in data boundary and security compliance scenarios, losing points on only 3 out of 30 questions, demonstrating stronger cross-round memory capabilities.
This shows that top-tier models have evolved "compliance" from superficial instruction-following into an intrinsic contextual priority ordering.
The Bottom Truth: GPT-o3's Systematic Collapse
GPT-o3 started from a low base in the previous round and gained only 5.8 points this time, far below the average improvement of other models. An R3 score of 0.90 means it violates constraints nearly every other time under direct pressure. Particularly in resource limitation and engineering specification scenarios, the model frequently agrees to "break the budget" or "skip code reviews," exposing its fragile memory of initial rules in multi-turn conversations.
The Gap Between Top and Bottom: The Real Divide Behind 18 Points
Behind the 52.4% perfect-score rate, the gap is concentrated in R3. Top models average 1.57 on R3, while tail-end models average just 1.12. Translated to real-world scenarios, choosing the right model can reduce compliance violation risk by nearly 30% for enterprise deployment. Chinese-language models Qwen3 Max and Wenxin Yiyan 4.5 both rank in the top six, proving that domestic models have transitioned from catching up to running alongside in the compliance dimension.
- The biggest improvers—Grok 4 (+35.8 points) and Wenxin Yiyan 4.5 (+32.5 points)—both made significant strides on R3.
- Claude series showed the smallest gains, reflecting that their safety alignment strategies actually become constraints under high-pressure testing.
While the pilot phase is not included in the main leaderboard, it already reveals a harsh reality: parameter scale is no longer linearly correlated with compliance capability; the choice of architecture and training objectives matters more.
If R3 weight continues to increase in the next round, GPT-5.5's lead may widen further, while GPT-o3 will need a complete overhaul of its contextual priority mechanism to turn the tide.
Data Source: YZ Index WDCD Compliance Ranking | Run #161 · Overall Ranking | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接