Grok 4 Leads with 74.22 Points, GPT-o3 Trails at 51.56 Points — WDCD Gap of 22.66

Grok 4 tops the WDCD compliance test with 74.22 points, while GPT-o3 finishes last at 51.56 points, a gap of 22.66 points.

The current WDCD leaderboard shows clear polarization. Grok 4 scores 1.22 points in R3, higher than second-place Qwen3 Max's 1.09 points, and far above last-place GPT-o3's 0.25 points. R3 has a weight of 2 points, accounting for half of the total score, and directly determines the final ranking. Although Gemini 2.5 Pro scored a perfect 1.00 in R1, its R3 score of only 0.97 points landed it in third place.

Score Breakdown of the Champion and Last-Place Models

Grok 4 maintained a consistently high level across the three rounds: R1 0.97, R2 0.78, R3 1.22. Last-place GPT-o3 scored 1.00 in R1 and 0.81 in R2, but dropped to only 0.25 in R3, indicating that while it could maintain constraints in the first two rounds, it quickly failed after direct pressure was applied in the third round. Claude Opus 4.7 also had only 0.34 points in R3, placing it alongside GPT-o3 at the bottom.

The top three models — Grok 4, Qwen3 Max, and Gemini 2.5 Pro — have an average R3 score of 1.09 points, while the bottom two — Claude Opus 4.7 and GPT-o3 — average only 0.295 points in R3, a gap of nearly four times.

R3 Collapse Rate and Overall Performance

Global statistics show an R3 collapse rate of 47.7% and a perfect score rate of only 19.3%. This means nearly half of the models failed to maintain initial constraints when directly pressured in the third round. Claude Sonnet 4.6 also scored a perfect 1.00 in R1, but only 0.69 in R3, finishing eighth overall — demonstrating that even with strong early performance, insufficient stress resistance can drag down the total score.

All 11 participating models saw a decline in scores compared to the previous period. Among them, GPT-5.5 dropped 23.5 points, Claude Sonnet 4.6 dropped 23.2 points, and Gemini 3.1 Pro dropped 22.7 points. Of the three models with the largest declines, two had R3 scores below 0.70, confirming that performance under pressure is the main factor driving the score decreases.

Gap Between the Top Tier and the Bottom Tier

The WDCD score range for the top four models is 74.22 to 64.84, while the bottom four range from 60.16 to 51.56. The top models average 0.69 points in R2, compared to 0.735 points for the bottom models — a small gap. However, entering R3, the top models average 1.11 points while the bottom models average only 0.52 points, quickly widening the gap.

Ernie 4.5 and Gemini 3.1 Pro both score 64.84 points, but Ernie 4.5 achieves 1.16 points in R3, higher than Gemini 3.1 Pro's 0.97 points, indicating differences in stress resistance despite the same total score. Although Doubao Pro reached 0.72 points in R2, higher than many models<|eos|>


Data source: YZ Index WDCD Compliance Leaderboard | Run #169 · Overall Rankings | Evaluation Methodology