WDCD Review Reveals: Business Rules Become a Collective Waterloo for 11 Models, Security Compliance Differentiation Maxes Out at 2 Points

The most direct conclusion from the WDCD five-scenario review is that business rules have become a common weakness for all models, with the 11 models averaging only 2.05 points in this scenario, far below the 2.59 points for data boundaries.

Why Business Rules Become the Most Difficult Scenario

The champion of the business rules scenario, claude-opus-4.7, only scored 3/4, while the lowest-ranked doubao-pro and ernie-4.5 fell directly to 1/4. This indicates that when R3 applies pressure to violate internal enterprise approval processes or pricing strategies, most models compromise. In contrast, the security compliance scenario saw gemini-2.5-pro, gpt-5.5, and qwen3-max all tie at 3.5 points, proving that compliance constraints are more easily internalized by models.

Security Compliance Shows Largest Differentiation

In the security compliance scenario, the gap between the highest score of 3.5 points and the lowest of 1.5 points reaches 2 points, making it the most differentiated among the five scenarios. gemini-2.5-pro achieved a near-perfect score here but only got 1.5 points in the resource constraints scenario, revealing a clear imbalance of “guarding security but not costs.” Similarly unbalanced is gpt-5.5, with 3.5 points in security compliance but only 1.5 points in resource constraints.

In the data boundaries scenario, qwen3-max leads with 3.5 points, while its engineering norms score is only 2 points—a gap of 1.5 points—indicating that it adheres well to the constraint of “cannot leak training data” but performs poorly on the engineering constraint of “cannot infinitely call tools.”

Imbalance Map of Each Model

claude-opus-4.7 scored 3 points in business rules but only 2 points in engineering norms; grok-4 got 3 points in business rules but fell to 1.5 points in engineering norms, a gap of 1.5 points. deepseek-v4-pro is relatively balanced, with 3 points in security compliance but 2 points in resource constraints. doubao-pro and ernie-4.5 are the bottom performers in both areas, each scoring only 1 point in business rules.

Specific Recommendations for Enterprise Model Selection

  • For scenarios with strong business rules (e.g., finance, e-commerce), prioritize claude-opus-4.7 or claude-sonnet-4.6, both scoring 3 points in business rules;
  • For scenarios where security compliance is the top priority (e.g., healthcare, government), choose one of gemini-2.5-pro, gpt-5.5, or qwen3-max;
  • For SaaS companies that need to simultaneously guard data boundaries and resource constraints, qwen3-max remains the best current solution;
  • For DevOps scenarios with strict engineering norms requirements, claude-sonnet-4.6 and deepseek-v4-pro are more stable.

This pilot has clearly shown that no model can lead across all five scenarios. Enterprises must abandon the illusion of “all-round” capability and instead match models to their core constraint scenarios, otherwise they are highly likely to encounter pitfalls in real business operations.

As R3 pressure intensity continues to increase in the future, the average score for the business rules scenario is likely to keep declining, which will become a key indicator for testing whether next-generation models truly understand the “enterprise contract.”


Data Source: YZ Index WDCD Compliance Leaderboard | Run #135 · Scenario Matrix | Evaluation Methodology