WDCD Review Reveals: Resource Constraints Become the Achilles' Heel of 11 Models, Average Score Only 1.7

The most brutal finding of the WDCD compliance test is that resource constraints crippled all models, with an average score of only 1.7 across 11 models, far below the other four scenarios.

Why Resource Constraints Became a Collective Waterloo

In the three-round dialogue design, when R3 directly pressured models to break resource quotas, most models immediately surrendered. gemini-2.5-pro, with a score of 2.5, was the only model exceeding 2 points, while the remaining ten models were all stuck in the 1–2 point range. doubao-pro and ernie-4.5 both dropped to 1 point, indicating that they had almost no resistance when facing persistent questioning such as "give a little more quota."

The Two Most Discriminating Scenarios

Resource constraints and data boundaries are the two dimensions where differences are most pronounced. In data boundaries, claude-opus-4.7 and claude-sonnet-4.6 scored 3 points, while the gemini series and ernie-4.5 scored only 1.5 points, a gap of 1.5 points. Resource constraints, however, pulled doubao-pro from the top of the business rules category directly to the bottom, with a single-scenario drop of 3 points.

Severe Uneven Performance Is Widespread

  • doubao-pro scored a perfect 4 in business rules but only 1 in resource constraints, a typical case of "good at reasoning but unable to hold the line."
  • claude-opus-4.7 scored 3.5 in security compliance and 3 in engineering standards, but only 1.5 in resource constraints, showing a clear shortfall in hard quota control.
  • deepseek-v4-pro scored 3.5 in security compliance but only 1.5 in data boundaries, indicating it is easily induced in sensitive data boundary scenarios.
  • gpt-5.5 and gpt-o3 both scored 4 in business rules, yet only 1.5 in resource constraints, also exhibiting the trait of "strong in business, weak in constraints."

Specific Recommendations for Enterprise Model Selection

If the core enterprise scenarios are financial risk control or medical compliance, prioritize claude-opus-4.7 or ernie-4.5, as these two models have the highest and most stable scores in security compliance scenarios.

If the business mainly involves internal approval workflows, contract terms, and pricing rules, doubao-pro and gpt-5.5 are more reliable, as they achieved perfect scores in the business rules scenario.

For teams that need strict control over API quotas, concurrency, and storage limits, currently no model can be trusted. Although gemini-2.5-pro is relatively the best, it still only scored 2.5 points. It is recommended to add an external rate-limiting layer.

The engineering standards scenario overall has high scores. Except for qwen3-max and ernie-4.5, all other models can achieve 3 points, making them suitable as alternatives.

No model passes all scenarios; model selection is essentially about accepting uneven performance.

The WDCD pilot phase has clearly revealed that resource constraints are the Achilles' heel common to all major models currently. If the weight of resource constraints is increased to 40% in the next phase, the rankings will undergo a drastic reshuffle.


Data source: YZ Index WDCD Compliance Ranking | Run #146 · Scenario Matrix | Evaluation Methodology