GPT-5.5 Tops at 88.33 Points, GPT-o3 Trails at 61.67 Points, R3 Collapse Rate 22.1%

The three rounds of dialogue in the WDCD Compliance Test are precisely designed to hit the weakest points of models: R1 injects constraints, R2 introduces irrelevant distractions, and R3 applies direct pressure. The final results show GPT-5.5 topping the chart with 88.33 points (R3 1.67/2), while GPT-o3 trails at the bottom with 61.67 points (R3 only 0.73/2), creating a gap of 26.66 points between the leader and the tail. The overall R3 collapse rate of 22.1% exposes the true obedience capability of current large models under sustained pressure.

Top Three Landscape: R3 Score Decides the Winner

GPT-5.5, Gemini 3.1 Pro, and Claude Sonnet 4.6 form the first tier, all achieving near-perfect scores in R1, with differences mainly stemming from R2 and R3. GPT-5.5 scores 0.87 in R2 and 1.67 in R3, demonstrating its ability to maintain a high proportion of constraints even after irrelevant topic interference. Gemini 3.1 Pro scores slightly higher in R2 (0.90) but slightly lower in R3 (1.60). Claude Sonnet 4.6 scores only 0.97 in R1, indicating a small probability of loosening already present in the initial constraint injection phase, yet it maintains 1.53 in R3, reflecting strong stress resistance.

The common characteristic of top models is that their R3 scores all exceed 1.5, which directly widens the gap from the mid-range.

Mid-Range Competition and Version Generation Gaps

DeepSeek V4 Pro, Grok 4, and Qwen3 Max all cluster in the 81-point range. All three achieve perfect scores in R1, but their R2 scores decline sequentially to 0.77, 0.80, and 0.73, showing that open-source/domestic models still have shortcomings in the anti-interference phase. ERNIE 4.5 (文心一言4.5) and Doubao Pro (豆包 Pro) score 77.5 and 75 points respectively, with R3 scores of 1.30 and 1.47, indicating significant concessions already occurring during the high-pressure phase.

Most notably, version comparisons: Gemini 2.5 Pro dropped 11.7 points from the previous period, GPT-o3 dropped 9.2 points, while Gemini 3.1 Pro rose 5.8 points and Claude Sonnet 4.6 rose 6.7 points. This indicates that new versions within the same series do not show linear improvement in compliance; instead, there are significant generational fluctuations.

Tail Truth: R3 Score Below 1 Means Collapse

GPT-o3, with an R3 score of 0.73, is the only model scoring below 1. Claude Opus 4.7 has an R3 score of only 0.97, also in the danger zone. Combined with the overall R3 collapse rate of 22.1%, it can be determined that when directly pressured in the third round, most current models have a compliance rate of around 50%, far below the stable threshold required for real enterprise deployment.

  • The full-score rate is only 43.6%, meaning over half of the models fail in at least one constraint scenario.
  • R3 accounts for 50% of the total score but contributes all major point losses.
  • Scenarios related to safety compliance and engineering standards have the highest collapse rates, far exceeding those related to data boundaries.

Core Assessment

The WDCD test reveals that simply chasing context length or instruction-following benchmarks is no longer sufficient to measure real usability. GPT-5.5's lead stems from its ability to maintain constraints under continuous R2-R3 interference, while GPT-o3's last place exposes its rapid forgetting problem in multi-round adversarial scenarios. The 26-point gap between the top and bottom essentially reflects different understandings of the "sustained obedience" required as a core enterprise need.

Pilot phase results are not included in the main leaderboard, but they clearly outline the threshold that next-generation models must cross: an R3 score consistently above 1.6 is the minimum requirement for entering production environments.

Prediction: In the next round of testing, optimizing R3 will become the primary alignment goal for all vendors. Current bottom models that cannot raise their R3 score above 1.2 may face the risk of being phased out of mainstream enterprise scenarios.


Data Source: YZ Index WDCD Compliance Leaderboard | Run #164 · Overall Ranking | Methodology