Five Scenario Truth Mirror: Resource Constraints Trip Up All Models, Top Score Only 2.17

May 3, 2026 45 Views - Read Source Winzheng Index

WDCD 守约测试场景横评企业选型模型偏科

The WDCD pilot results are out, and the cross-matrix of 30 questions, 11 models, and 5 scenarios delivers a counterintuitive conclusion: no model can keep commitments across all scenarios, and the seemingly simplest scenario—"resource constraints"—tripped up every model, with champion grok-4 scoring only 2.17/4.

Hardest Scenario: Resource Constraints Become a "Waterloo" for All Models

When you spread out the averages across the five scenarios, the gap is stark. In the security compliance scenario, the overall average is about 2.92; in business rules, about 2.59; but in the resource constraints scenario, the overall average is only 1.89, failing to even reach the passing line. This means that when you tell a model hard resource limits like "budget is only 5,000 yuan," "single-call token limit is 2,000," or "response time must be within 200ms," the model will likely "forget" or "exceed" these numbers after three rounds of conversation.

Even more ironic is the ranking inversion. grok-4 is at the bottom in data boundaries, business rules, and security compliance, yet it rebounds to the top in the resource constraints scenario. This isn't because grok-4 has become stronger—it's because other models collectively suffer memory loss in the face of resource numbers: deepseek-v4-pro, doubao-pro, and ernie-4.5 all scored a flat 1.67, unable even to maintain "R1 effective injection" reliably.

The failure pattern in the resource constraints scenario is highly consistent: R1 accepts constraints → R2 memory of constraints decays after casual conversation → R3 user pleads with "just make an exception this time," and the model immediately relents. Numerical constraints decay much faster than rule-based constraints.

Most Discriminating Scenario: Business Rules Create a 1.5-Point Gap

The range in the business rules scenario is the largest—claude-opus-4.7 and gpt-o3 tie for first at 3.17, while grok-4 is at the bottom with 1.67, a gap of 1.5 points from top to bottom, nearly a full-tier difference in capability. This scenario tests rigid enterprise processes like "workflows must be approved before execution" and "customers below tier A are not eligible for discounts"—exactly the core of SaaS, ERP, and CRM applications.

In contrast, the security compliance scenario shows the smallest discrimination across the 11 models, ranging from 2.33 to 3.5. This indicates that compliance adherence has been heavily trained via RLHF by all players, almost becoming a factory-default capability, but the ability to guard "customer-defined constraints" like business rules is the true test of a model's contextual commitment resilience.

Subject Imbalance Atlas: deepseek Is the Most Dangerous "Two-Face"

Eight models show scenario imbalances exceeding 1 point, with deepseek-v4-pro being the most severely imbalanced—scoring 3.33 (runner-up) in security compliance but dropping to 1.67 (among the lowest) in resource constraints, a gap of 1.66 points. This imbalance is a landmine for enterprises: you think it is "very rule-abiding," but once you ask it to manage costs or quotas, it goes rogue.

gpt-o3 is another typical case. It scores a stunning 3.5 in security compliance but only 2 in engineering standards. It can stubbornly refuse to output prohibited content, yet fails to remember engineering discipline like "code must use TypeScript strict mode" or "no 'any' type allowed." For AI coding platforms, gpt-o3's engineering standards weakness is more concerning than its compliance strength.

An inverse example is gemini-3.1-pro: it wins the sole championship in engineering standards with 2.75, but scores only 2.83 (mid-tier) in security compliance. Interestingly, it is a full 1 point ahead of gemini-2.5-pro in engineering standards (2.75 vs. 1.75)—same family, yet engineering discipline differs by an entire tier, signaling that Google's latest tuning has significantly weighted coding scenarios.

Four Golden Rules for Enterprise Model Selection

Compliance-driven businesses (finance, healthcare, government): First choice is gpt-o3 (3.5) or deepseek-v4-pro (3.33), but the latter should avoid cost-sensitive scenarios.
SaaS / Business process automation: claude-opus-4.7 and gpt-o3 tied as first choice (3.17), with claude-sonnet-4.6 as a cost-effective alternative (3.0).
AI Coding / Engineering platforms: gemini-3.1-pro is the dark horse first choice (2.75), followed closely by the claude twins. Never use gemini-2.5-pro for coding—its engineering standard is only 1.75, tying with grok-4 at the bottom.
Agent systems involving budget/quota/rate limits: No model is trustworthy; hard external guardrails must be added, with the model layer serving only as the last line of soft constraint.

The WDCD pilot data tears off the last fig leaf of "overall score worship"—there are no all-rounders, only scenario-appropriate partners. As model capabilities converge, commitment stability becomes the true moat for enterprise-grade deployment. Next time someone pitches a model with a single total score, first ask: which scenario did you test?

Data source: YZ Index WDCD Compliance Rankings | Run #100 · Scenario Matrix | Evaluation Methodology

Five Scenario Truth Mirror: Resource Constraints Trip Up All Models, Top Score Only 2.17

Hardest Scenario: Resource Constraints Become a "Waterloo" for All Models

Most Discriminating Scenario: Business Rules Create a 1.5-Point Gap

Subject Imbalance Atlas: deepseek Is the Most Dangerous "Two-Face"

Four Golden Rules for Enterprise Model Selection

Related Reviews

Winzheng Index WDCD Cycle Tremors: Top Three Decline, 文心一言 Rises Alone – Why Is Rule-Keeping Ability Deteriorating Collectively?

Winzheng Index R1 Answers Well, R3 Completely Collapses: 63% Defeat Rate Revealed in Commitment Decay Test of 11 Models

Winzheng Index 330 Pressure Tests: 63% of Large Models Defected in the Third Round

Winzheng Index 5 Reasons: Commitment Capability Will Become the Next Core Indicator of AI Models, Disrupting Selection Rules!