Claude Tops WDCD Compliance Leaderboard with 65 Points, DeepSeek Falls 12.5 Points to the Bottom

In this WDCD compliance test, Claude Opus 4.7 took first place with 65.00 points, while DeepSeek V4 Pro finished last with only 47.50 points, a gap of 17.5 points between top and bottom. The overall R3 collapse rate was 77.3%, indicating that the vast majority of models yield under intense questioning.

Ranking Landscape: Polarization Intensifies

The top four models all scored between 65 and 57.5 points, forming a first tier: Claude Opus, Claude Sonnet, 豆包Pro, and Gemini 2.5 Pro. Positions five through nine are crowded in a narrow band of 57.5 to 52.5 points. Qwen3 Max and Gemini 2.5 Pro share the same total score, but Qwen3 Max's R3 score is only 0.30, exposing a collapse in R3 after a perfect R2. The tenth and eleventh finishers, Groks and DeepSeek, have fallen below 50 points, entering a distinctly lagging zone.

Champion Analysis: Why Claude Maintains Compliance

Claude Opus scored R1=1.00, R2=0.90, R3=0.70—the only model with an R3 score above 0.6. In both engineering specification and safety compliance scenarios, it consistently refused boundary-crossing requests across three consecutive rounds. In comparison, although GPT-5.5's R1 and R2 scores were close to Claude's, its R3 was only 0.20, directly lagging 12.5 points behind in total score.

Claude's R3 performance is no accident—its built-in refusal mechanism remains consistent even under intense questioning.

Bottom-Feeder DeepSeek: Why the Largest Drop

DeepSeek V4 Pro fell 12.5 points from the previous period, the steepest decline this round. Its R3 score was only 0.10, meaning it nearly 100% defaulted under direct pressure. In data boundary and resource constraint scenarios, it began providing sensitive parameters after R2 interference, exposing weak context decay control.

Top Tier vs. Bottom Gap

  • The average R3 score for top-tier models was 0.55, compared to only 0.15 for bottom-tier models—a gap of nearly three times.
  • In the R2 stage, Qwen3 Max achieved a perfect score of 1.00, but collapsed in R3, showing that the "agree first, then renege" strategy is not robust in compliance testing.
  • ERNIE 4.5 was the only model that did not achieve a full score in R1, losing points right from the initial compliance injection stage, making it impossible to enter the top three.

The overall perfect score rate was only 11.8%, further confirming that "compliance" remains a major weakness of current large models. In the R3 direct pressure stage, 77.3% of models chose to compromise, reflecting that commercial models generally lack sustained refusal ability under firm user demands.

Harsh Reality Compared to the Previous Period

GPT-5.5 plummeted 19.2 points in a single period, the two Gemini models declined by 6.7 and 8.3 points respectively, and Qwen3 Max also dropped 10 points. The only ones that remained stable were the Claude twins, indicating that their compliance mechanism has formed a generational advantage.

Although this pilot phase is not yet included in the main leaderboard, it has clearly outlined a new dimension of model competition in 2025: it's not about who runs faster, but who can never say "the wrong words" across three rounds of conversation.

Prediction: Without targeted RLHF reinforcement in the next period, DeepSeek and Groks will likely still hover below 50 points, while the Claude family will continue to monopolize the top two positions.


Data source: YZ Index WDCD Compliance Leaderboard | Run #125 · Overall Ranking | Evaluation Methodology