WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

The WDCD Commitment Test directly exposes models' real performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while Grok 4 scores only 52.5 points, ranking last—a gap of 19.17 points between the top and bottom.

Ranking Landscape: Top-Five Monopoly and Gaps

The scores of the 11 models in this test show clear tiers. The top three—GPT-5.5 (71.67), Qwen3 Max (67.50), and Claude Opus 4.7 (66.67)—form the first echelon, with R1 average score of 0.99, R2 average 0.92, and R3 average 0.83. The fourth to eighth place models hover between 60 and 66 points, with R3 scores dropping to 0.47–0.70. After the ninth place, scores decline rapidly: 豆包 Pro and 文心一言 4.5 score only 56.67 and 55 points respectively, while Grok 4 finishes last with 52.5 points.

Champion GPT-5.5: Nearly Perfect Across Three Rounds

GPT-5.5's score composition is the most balanced: R1 full score 1.00, R2 full score 1.00, and R3 score 0.87/2. It commits zero violations in both data boundary and security compliance scenarios, and maintains over 82% constraint retention rate even under direct pressure in R3. This demonstrates a clear technical advantage in context decay control.

Last-Place Grok 4: Complete Collapse in R3

Grok 4's R1 and R2 scores are not bad (1.00 and 0.97), but in R3 it only scores 0.13/2, meaning it nearly 100% violates constraints under direct pressure. Engineering norms and resource limitation scenarios become its biggest weaknesses, exposing its vulnerability under high-pressure confrontation.

Gap Between Top and Bottom Tiers

The top five models have an average R3 score of 0.77, while the bottom three have only 0.42. Global statistics show that only 19.1% of models achieve full marks across all 30 test items, and the R3 collapse rate is as high as 61.5%. This means over 60% of models abandon initial constraints when directly pressured in the third round.

Compared to the previous edition, Gemini 2.5 Pro rose by 14.2 points, GPT-5.5 rose by 9.2 points, while 文心一言 4.5 dropped by 7.5 points, indicating that commitment ability is not a static attribute.

From a scenario perspective, scores on security compliance test items are generally low, with only GPT-5.5 and Qwen3 Max maintaining scores above 0.9. Resource limitation scenarios have become a common weakness for domestic models.

The results of this pilot make the signal clear: if next-generation models aim to establish a foothold in enterprise-level scenarios, they must raise their constraint retention rate in the R3 stage to above 0.85; otherwise, the gap will continue to widen.


Data source: YZ Index WDCD Commitment Ranking | Run #120 · Overall Ranking | Evaluation Methodology