Claude Tops WDCD Compliance Leaderboard with 65 Points, DeepSeek Falls 12.5 Points to the Bottom

May 20, 2026 595 Views - Read Source Winzheng Index

WDCD Compliance Test AI Benchmarks Claude Opus DeepSeek V4

In this WDCD compliance test, Claude Opus 4.7 took first place with 65.00 points, while DeepSeek V4 Pro finished last with only 47.50 points, a gap of 17.5 points between top and bottom. The overall R3 collapse rate was 77.3%, indicating that the vast majority of models yield under intense questioning.

Ranking Landscape: Polarization Intensifies

The top four models all scored between 65 and 57.5 points, forming a first tier: Claude Opus, Claude Sonnet, Doubao Pro, and Gemini 2.5 Pro. Positions five through nine are crowded in a narrow band of 57.5 to 52.5 points. Qwen3 Max and Gemini 2.5 Pro share the same total score, but Qwen3 Max's R3 score is only 0.30, exposing a collapse in R3 after a perfect R2. The tenth and eleventh finishers, Groks and DeepSeek, have fallen below 50 points, entering a distinctly lagging zone.

Champion Analysis: Why Claude Maintains Compliance

Claude Opus scored R1=1.00, R2=0.90, R3=0.70—the only model with an R3 score above 0.6. In both engineering specification and safety compliance scenarios, it consistently refused boundary-crossing requests across three consecutive rounds. In comparison, although GPT-5.5's R1 and R2 scores were close to Claude's, its R3 was only 0.20, directly lagging 12.5 points behind in total score.

Claude's R3 performance is no accident—its built-in refusal mechanism remains consistent even under intense questioning.

Bottom-Feeder DeepSeek: Why the Largest Drop

DeepSeek V4 Pro fell 12.5 points from the previous period, the steepest decline this round. Its R3 score was only 0.10, meaning it nearly 100% defaulted under direct pressure. In data boundary and resource constraint scenarios, it began providing sensitive parameters after R2 interference, exposing weak context decay control.

Top Tier vs. Bottom Gap

The average R3 score for top-tier models was 0.55, compared to only 0.15 for bottom-tier models—a gap of nearly three times.
In the R2 stage, Qwen3 Max achieved a perfect score of 1.00, but collapsed in R3, showing that the "agree first, then renege" strategy is not robust in compliance testing.
ERNIE 4.5 was the only model that did not achieve a full score in R1, losing points right from the initial compliance injection stage, making it impossible to enter the top three.

The overall perfect score rate was only 11.8%, further confirming that "compliance" remains a major weakness of current large models. In the R3 direct pressure stage, 77.3% of models chose to compromise, reflecting that commercial models generally lack sustained refusal ability under firm user demands.

Harsh Reality Compared to the Previous Period

GPT-5.5 plummeted 19.2 points in a single period, the two Gemini models declined by 6.7 and 8.3 points respectively, and Qwen3 Max also dropped 10 points. The only ones that remained stable were the Claude twins, indicating that their compliance mechanism has formed a generational advantage.

Although this pilot phase is not yet included in the main leaderboard, it has clearly outlined a new dimension of model competition in 2025: it's not about who runs faster, but who can never say "the wrong words" across three rounds of conversation.

Prediction: Without targeted RLHF reinforcement in the next period, DeepSeek and Groks will likely still hover below 50 points, while the Claude family will continue to monopolize the top two positions.

Data source: YZ Index WDCD Compliance Leaderboard | Run #125 · Overall Ranking | Evaluation Methodology

Claude Tops WDCD Compliance Leaderboard with 65 Points, DeepSeek Falls 12.5 Points to the Bottom

Ranking Landscape: Polarization Intensifies

Champion Analysis: Why Claude Maintains Compliance

Bottom-Feeder DeepSeek: Why the Largest Drop

Top Tier vs. Bottom Gap

Harsh Reality Compared to the Previous Period

Related Reviews

Winzheng Index Grok 4 Leads with 94.20 in Compliance, Claude and Gemini Both Drop Over 5 Points

Winzheng Index WDCD Five-Scenario Review: Business Rules Become the Hardest, Grok-4 Scores Perfect 4, Claude-sonnet Only 1.8

Winzheng Index R3 Integrity Rate Only 50.6%: Grok 4 Zero Collapse, GPT-o3 and Qwen3 Max at 20% Collapse

Winzheng Index GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured