The WDCD pilot results are out, and the cross-matrix of 30 questions, 11 models, and 5 scenarios delivers a counterintuitive conclusion: no model can keep commitments across all scenarios, and the seemingly simplest scenario—"resource constraints"—tripped up every model, with champion grok-4 scoring only 2.17/4.
Hardest Scenario: Resource Constraints Become a "Waterloo" for All Models
When you spread out the averages across the five scenarios, the gap is stark. In the security compliance scenario, the overall average is about 2.92; in business rules, about 2.59; but in the resource constraints scenario, the overall average is only 1.89, failing to even reach the passing line. This means that when you tell a model hard resource limits like "budget is only 5,000 yuan," "single-call token limit is 2,000," or "response time must be within 200ms," the model will likely "forget" or "exceed" these numbers after three rounds of conversation.
Even more ironic is the ranking inversion. grok-4 is at the bottom in data boundaries, business rules, and security compliance, yet it rebounds to the top in the resource constraints scenario. This isn't because grok-4 has become stronger—it's because other models collectively suffer memory loss in the face of resource numbers: deepseek-v4-pro, doubao-pro, and ernie-4.5 all scored a flat 1.67, unable even to maintain "R1 effective injection" reliably.
The failure pattern in the resource constraints scenario is highly consistent: R1 accepts constraints → R2 memory of constraints decays after casual conversation → R3 user pleads with "just make an exception this time," and the model immediately relents. Numerical constraints decay much faster than rule-based constraints.
Most Discriminating Scenario: Business Rules Create a 1.5-Point Gap
The range in the business rules scenario is the largest—claude-opus-4.7 and gpt-o3 tie for first at 3.17, while grok-4 is at the bottom with 1.67, a gap of 1.5 points from top to bottom, nearly a full-tier difference in capability. This scenario tests rigid enterprise processes like "workflows must be approved before execution" and "customers below tier A are not eligible for discounts"—exactly the core of SaaS, ERP, and CRM applications.
In contrast, the security compliance scenario shows the smallest discrimination across the 11 models, ranging from 2.33 to 3.5. This indicates that compliance adherence has been heavily trained via RLHF by all players, almost becoming a factory-default capability, but the ability to guard "customer-defined constraints" like business rules is the true test of a model's contextual commitment resilience.
Subject Imbalance Atlas: deepseek Is the Most Dangerous "Two-Face"
Eight models show scenario imbalances exceeding 1 point, with deepseek-v4-pro being the most severely imbalanced—scoring 3.33 (runner-up) in security compliance but dropping to 1.67 (among the lowest) in resource constraints, a gap of 1.66 points. This imbalance is a landmine for enterprises: you think it is "very rule-abiding," but once you ask it to manage costs or quotas, it goes rogue.
gpt-o3 is another typical case. It scores a stunning 3.5 in security compliance but only 2 in engineering standards. It can stubbornly refuse to output prohibited content, yet fails to remember engineering discipline like "code must use TypeScript strict mode" or "no 'any' type allowed." For AI coding platforms, gpt-o3's engineering standards weakness is more concerning than its compliance strength.
An inverse example is gemini-3.1-pro: it wins the sole championship in engineering standards with 2.75, but scores only 2.83 (mid-tier) in security compliance. Interestingly, it is a full 1 point ahead of gemini-2.5-pro in engineering standards (2.75 vs. 1.75)—same family, yet engineering discipline differs by an entire tier, signaling that Google's latest tuning has significantly weighted coding scenarios.
Four Golden Rules for Enterprise Model Selection
- Compliance-driven businesses (finance, healthcare, government): First choice is gpt-o3 (3.5) or deepseek-v4-pro (3.33), but the latter should avoid cost-sensitive scenarios.
- SaaS / Business process automation: claude-opus-4.7 and gpt-o3 tied as first choice (3.17), with claude-sonnet-4.6 as a cost-effective alternative (3.0).
- AI Coding / Engineering platforms: gemini-3.1-pro is the dark horse first choice (2.75), followed closely by the claude twins. Never use gemini-2.5-pro for coding—its engineering standard is only 1.75, tying with grok-4 at the bottom.
- Agent systems involving budget/quota/rate limits: No model is trustworthy; hard external guardrails must be added, with the model layer serving only as the last line of soft constraint.
The WDCD pilot data tears off the last fig leaf of "overall score worship"—there are no all-rounders, only scenario-appropriate partners. As model capabilities converge, commitment stability becomes the true moat for enterprise-grade deployment. Next time someone pitches a model with a single total score, first ask: which scenario did you test?
Data source: YZ Index WDCD Compliance Rankings | Run #100 · Scenario Matrix | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接