Grok 4 Scores Perfect 100 to Dominate WDCD Commitment Ranking, GPT-5.5 Trails with Only 62.5 Points

Jul 1, 2026 16 Views - Read Source Winzheng Index

WDCD Compliance Test 模型排行榜 AI合规约束保持能力

In this WDCD commitment test, Grok 4 ranked first with a score of 100.00 (R1=1.00, R2=1.00, R3=2.00/2), while GPT-5.5 ranked last with 62.50 points (R1=1.00, R2=0.50, R3=1.00/2). Only 61.8% of the 11 models achieved a perfect score.

Ranking Pattern: Perfect Score Monopoly and Multi-Level Disparity

This ranking shows clear tier characteristics. Grok 4 achieved perfect scores across all three rounds, making it the only model with a WDCD score of 100.00. 豆包 Pro followed closely with 92.50 points, scoring 1.90/2 in R3, demonstrating strong constraint maintenance capability. The third to sixth models — Claude Opus 4.7, Gemini 3.1 Pro, Claude Sonnet 4.6, and Qwen3 Max — all scored between 87.50 and 90.00 points, with R2 scores generally between 0.70 and 0.90, indicating that the interference phase has become a major point of loss.

The seventh to ninth models — Gemini 2.5 Pro, DeepSeek V4 Pro, 文心一言 4.5 — scored between 82.50 and 85.00, with R3 scores dropping to 1.50-1.70. The tenth and eleventh models, GPT-o3 and GPT-5.5, lagged significantly, with R2 scores of only 0.50 and R3 scores of 1.30 and 1.00 respectively, exposing clear weaknesses under sustained pressure.

Champion Analysis: Grok 4's Perfect Three-Round Performance

Grok 4 maintained perfect scores across all three phases — R1 constraint injection, R2 irrelevant topic interference, and R3 direct pressure — demonstrating stable execution across five constraint scenarios (data boundaries, resource limits, business rules, security compliance, and engineering norms). In contrast, although 豆包 Pro scored 0.90 in both R1 and R2, it lost 0.10 points in R3, indicating even top models experience slight loosening under final pressure.

Reasons for Bottom Ranking: GPT Series Collapse in Both R2 and R3

The common characteristic of GPT-5.5 and GPT-o3 is that their R2 scores are only 0.50, far below the 0.70-0.90 range of other models. In the R3 phase, GPT-5.5 scored only 1.00/2, and GPT-o3 scored 1.30/2, together losing 1.70-2.00 points compared to Grok 4. The overall R3 collapse rate of 12.7% also confirms that the direct pressure phase is the biggest risk point for model commitment.

Gap Between Top and Bottom: A Real Difference of 37.5 Points

The 37.5-point gap between Grok 4 and GPT-5.5 primarily comes from R2 and R3. Top models lost an average of less than 0.30 points in the interference and pressure phases, while bottom models lost over 1.50 points. Claude Opus 4.7 improved by 25.0 points compared to the previous period, and 豆包 Pro improved by 20.0 points, indicating that some models have made progress in R3 through targeted optimization. However, the GPT series has not shown a similar magnitude of recovery.

This is a pilot phase and does not count toward the main ranking, but it already covers 10 real enterprise scenario questions. The rule-based scoring method ensures the objectivity of the results. As a double-weight phase, R3's 12.7% collapse rate directly determines the final ranking distribution.

Constraint maintenance capability is becoming a core metric for distinguishing next-generation models. Grok 4's perfect performance may herald a new standard in engineering specification scenarios.

Data source: YZ Index WDCD Commitment Ranking | Run #207 · Overall Ranking | Evaluation Methodology

Grok 4 Scores Perfect 100 to Dominate WDCD Commitment Ranking, GPT-5.5 Trails with Only 62.5 Points

Ranking Pattern: Perfect Score Monopoly and Multi-Level Disparity

Champion Analysis: Grok 4's Perfect Three-Round Performance

Reasons for Bottom Ranking: GPT Series Collapse in Both R2 and R3

Gap Between Top and Bottom: A Real Difference of 37.5 Points

Related Reviews

Winzheng Index Qwen3 Max tops WDCD Compliance Leaderboard with 84.38 points, GPT-o3 at bottom with 67.19 points, a gap of 17 points

Winzheng Index WDCD Three-Round Test: Grok 4 Zero Crashes, GPT-5.5 Five R3 Collapses

Winzheng Index Claude Scores Largest Increase of 19.8 Points; All Eight WDCD Models Rise, None Decline

Winzheng Index WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57