In this WDCD commitment test, Grok 4 ranked first with a score of 100.00 (R1=1.00, R2=1.00, R3=2.00/2), while GPT-5.5 ranked last with 62.50 points (R1=1.00, R2=0.50, R3=1.00/2). Only 61.8% of the 11 models achieved a perfect score.
Ranking Pattern: Perfect Score Monopoly and Multi-Level Disparity
This ranking shows clear tier characteristics. Grok 4 achieved perfect scores across all three rounds, making it the only model with a WDCD score of 100.00. 豆包 Pro followed closely with 92.50 points, scoring 1.90/2 in R3, demonstrating strong constraint maintenance capability. The third to sixth models — Claude Opus 4.7, Gemini 3.1 Pro, Claude Sonnet 4.6, and Qwen3 Max — all scored between 87.50 and 90.00 points, with R2 scores generally between 0.70 and 0.90, indicating that the interference phase has become a major point of loss.
The seventh to ninth models — Gemini 2.5 Pro, DeepSeek V4 Pro, 文心一言 4.5 — scored between 82.50 and 85.00, with R3 scores dropping to 1.50-1.70. The tenth and eleventh models, GPT-o3 and GPT-5.5, lagged significantly, with R2 scores of only 0.50 and R3 scores of 1.30 and 1.00 respectively, exposing clear weaknesses under sustained pressure.
Champion Analysis: Grok 4's Perfect Three-Round Performance
Grok 4 maintained perfect scores across all three phases — R1 constraint injection, R2 irrelevant topic interference, and R3 direct pressure — demonstrating stable execution across five constraint scenarios (data boundaries, resource limits, business rules, security compliance, and engineering norms). In contrast, although 豆包 Pro scored 0.90 in both R1 and R2, it lost 0.10 points in R3, indicating even top models experience slight loosening under final pressure.
Reasons for Bottom Ranking: GPT Series Collapse in Both R2 and R3
The common characteristic of GPT-5.5 and GPT-o3 is that their R2 scores are only 0.50, far below the 0.70-0.90 range of other models. In the R3 phase, GPT-5.5 scored only 1.00/2, and GPT-o3 scored 1.30/2, together losing 1.70-2.00 points compared to Grok 4. The overall R3 collapse rate of 12.7% also confirms that the direct pressure phase is the biggest risk point for model commitment.
Gap Between Top and Bottom: A Real Difference of 37.5 Points
The 37.5-point gap between Grok 4 and GPT-5.5 primarily comes from R2 and R3. Top models lost an average of less than 0.30 points in the interference and pressure phases, while bottom models lost over 1.50 points. Claude Opus 4.7 improved by 25.0 points compared to the previous period, and 豆包 Pro improved by 20.0 points, indicating that some models have made progress in R3 through targeted optimization. However, the GPT series has not shown a similar magnitude of recovery.
This is a pilot phase and does not count toward the main ranking, but it already covers 10 real enterprise scenario questions. The rule-based scoring method ensures the objectivity of the results. As a double-weight phase, R3's 12.7% collapse rate directly determines the final ranking distribution.
Constraint maintenance capability is becoming a core metric for distinguishing next-generation models. Grok 4's perfect performance may herald a new standard in engineering specification scenarios.
Data source: YZ Index WDCD Commitment Ranking | Run #207 · Overall Ranking | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接