Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models

Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models

The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. The resource limitation scenario scored the lowest overall, becoming a common "stumbling block" for all 11 models.

Why Resource Limitation Is the Biggest Challenge

The resource limitation scenario requires models to strictly adhere to explicit quotas, concurrency limits, and cost budgets, with the highest pressure in the R3 phase. deepseek-v4-pro leads with 2.33 points, but the remaining models generally score below 2 points, with doubao-pro at the bottom with only 1.33 points. This indicates that most models tend to compromise when faced with "hard budgets," prioritizing immediate user needs over long-term constraints.

Safety Compliance Scenario Shows the Highest Differentiation

The safety compliance scenario shows the greatest gap. gemini-3.1-pro and qwen3-max are tied at 3.5 points, while grok-4 scores only 2.33 points. The gemini series can maintain compliance boundaries even during the R2 interference phase, demonstrating more stable internal safety alignment. This scenario is suitable as a primary screening indicator for financial and healthcare enterprises sensitive to regulatory requirements.

Real Risks of Specialized Models

doubao-pro scored 3.17 points (tied for first) in business rules, but plummeted to 1.33 points in resource limitations, a gap of 1.84 points between scenarios. qwen3-max scored 3.5 points in safety compliance but only 2 points in engineering standards, a gap of 1.5 points. gpt-o3 scored 3.17 points in business rules but 2 points in engineering standards, also showing significant weaknesses. Enterprises that only look at a single scenario leaderboard can easily choose the wrong model.

Champion Model Profiles by Scenario

  • Data boundary: qwen3-max 3.13 points, suitable for strict data isolation scenarios
  • Business rules: doubao-pro, gpt-o3, qwen3-max tied at 3.17 points, strongest rule execution
  • Safety compliance: gemini-3.1-pro, qwen3-max 3.5 points, top choices for regulatory compliance
  • Engineering standards: claude-sonnet-4.6 3 points, outstanding performance in code and process constraints

Specific Recommendations for Enterprise Model Selection

For enterprises that need to handle multiple scenario constraints simultaneously, qwen3-max or gemini-3.1-pro are recommended first, as both rank in the top three for safety and data boundaries and have relatively low specialization bias. For SaaS or internal approval systems that purely pursue business rule implementation, doubao-pro can be considered, but it must be paired with a model stronger in resource limitations for secondary verification. claude-sonnet-4.6 is suitable for DevOps and code review scenarios with high engineering standards.

The low scores in resource limitations expose a systemic shortcoming of current large models when it comes to "saying no."

If future versions introduce dynamic budget adjustment tests in the resource limitation scenario, the current rankings of leading models could undergo a dramatic reshuffle.


Data source: YZ Index WDCD Compliance Ranking | Run #140 · Scenario Matrix | Evaluation Methodology