GPT-5.5 Tops WDCD with 89.17 Points, GPT-o3 Trails at 70.83 Points in Collapse

Jun 11, 2026 59 Views - Read Source Winzheng Index

WDCD Compliance Test AI模型排行约束遵循大模型评估

The first round of WDCD compliance test results are out, with GPT-5.5 taking the top spot at 89.17 points, while GPT-o3 ranks last at just 70.83 points—a gap of more than 18 points. This data directly punctures the myth that "older models are more stable."

Ranking Landscape: Three Distinct Tiers

The top four form the first tier: GPT-5.5 leads alone, Grok 4 and Qwen3 Max are tied at 85.83 points, followed closely by Gemini 2.5 Pro at 85.00 points. All four score near-perfect marks on R1 and generally land between 1.53 and 1.67 on R3, demonstrating their ability to maintain initial constraints across three rounds of stress testing.

The second tier spans from DeepSeek V4 Pro to Gemini 3.1 Pro, with scores concentrated between 81 and 82.5 points. Notably, while Doubao Pro scores only 0.77 on R1, it achieves a high R3 score of 1.60, showing it holds onto rules more tenaciously under high pressure.

The tail end consists of just two Claude models and GPT-o3. Claude Opus 4.7 and Sonnet 4.6 both drop to 1.23 on R3, while GPT-o3 plummets to 0.90—its R3 collapse rate directly raises the overall average by 20%.

Champion Analysis: Why GPT-5.5 Scored 89.17

GPT-5.5's victory hinges on maintaining a 0.90 score during the R2 interference phase and achieving 1.67 on R3. In contrast, most models, once led astray by irrelevant topics in R2, quickly lose ground in R3. GPT-5.5 performs exceptionally well in data boundary and security compliance scenarios, losing points on only 3 out of 30 questions, demonstrating stronger cross-round memory capabilities.

This shows that top-tier models have evolved "compliance" from superficial instruction-following into an intrinsic contextual priority ordering.

The Bottom Truth: GPT-o3's Systematic Collapse

GPT-o3 started from a low base in the previous round and gained only 5.8 points this time, far below the average improvement of other models. An R3 score of 0.90 means it violates constraints nearly every other time under direct pressure. Particularly in resource limitation and engineering specification scenarios, the model frequently agrees to "break the budget" or "skip code reviews," exposing its fragile memory of initial rules in multi-turn conversations.

The Gap Between Top and Bottom: The Real Divide Behind 18 Points

Behind the 52.4% perfect-score rate, the gap is concentrated in R3. Top models average 1.57 on R3, while tail-end models average just 1.12. Translated to real-world scenarios, choosing the right model can reduce compliance violation risk by nearly 30% for enterprise deployment. Chinese-language models Qwen3 Max and Wenxin Yiyan 4.5 both rank in the top six, proving that domestic models have transitioned from catching up to running alongside in the compliance dimension.

The biggest improvers—Grok 4 (+35.8 points) and Wenxin Yiyan 4.5 (+32.5 points)—both made significant strides on R3.
Claude series showed the smallest gains, reflecting that their safety alignment strategies actually become constraints under high-pressure testing.

While the pilot phase is not included in the main leaderboard, it already reveals a harsh reality: parameter scale is no longer linearly correlated with compliance capability; the choice of architecture and training objectives matters more.

If R3 weight continues to increase in the next round, GPT-5.5's lead may widen further, while GPT-o3 will need a complete overhaul of its contextual priority mechanism to turn the tide.

Data Source: YZ Index WDCD Compliance Ranking | Run #161 · Overall Ranking | Evaluation Methodology

GPT-5.5 Tops WDCD with 89.17 Points, GPT-o3 Trails at 70.83 Points in Collapse

Ranking Landscape: Three Distinct Tiers

Champion Analysis: Why GPT-5.5 Scored 89.17

The Bottom Truth: GPT-o3's Systematic Collapse

The Gap Between Top and Bottom: The Real Divide Behind 18 Points

Related Reviews

Winzheng Index Three Tied at 70 on WDCD Commitment List, Ernie Bot 4.5 Collapses to 50 at Bottom

Winzheng Index R3 Collapse Rate 85%! 11 Models WDCD Three-Round Test: The True Decay Curve from Promise to Betrayal

Winzheng Index WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

Winzheng Index R3 Collapse Rate Differs by 7x! Real Attenuation of 11 Models in WDCD Three-Round Commitment