11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap

WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro only getting 2.5 points and doubao-pro at the bottom with 1 point; the Business Rules scenario became the biggest differentiator, with gemini-2.5-pro and gpt-o3 both scoring a full 4 points, while claude-opus-4.7 scored only 2 points.

Why Resource Constraints Collectively Failed

During the three rounds of compliance tests, the R3 direct pressure round caused the most severe damage to resource constraint adherence. gemini-3.1-pro, with 2.5 points, was the only model to exceed 2 points, while the other 10 models were all stuck at 2 points or below. doubao-pro, after being disturbed by an irrelevant topic in R2, directly abandoned constraints in R3, leaving it with only 1 point. The Data Boundary scenario champion, claude-opus-4.7, only got 1.5 points here, revealing its insufficient sensitivity to dynamic limits such as "compute quotas" and "concurrency caps".

Business Rules Show the Highest Differentiation

The Business Rules scenario had the largest score range, from 4 to 2. After the constraint "do not bypass approval levels" was injected in R1, gemini-2.5-pro and gpt-o3 strictly followed it in R2 and R3, achieving full marks. The four models claude-opus-4.7, ernie-4.5, gemini-3.1-pro, and grok-4 all only got 2 points, revealing a clear deficiency in the ability to adhere to internal corporate process rules.

Specialized Models Exposed in Concentration

There were as many as 11 models with a gap of ≥1 point. claude-opus-4.7's 3.5 points in Data Boundary and 1.5 points in Resource Constraints formed a 2-point gap; gpt-o3's 4 points in Business Rules and 1.5 points in Resource Constraints showed a gap of 2.5 points; doubao-pro's 3 points in Business Rules and only 1 point in Resource Constraints also had a 2-point gap. These models performed well in one scenario but quickly failed in another, indicating that their compliance ability highly depends on the scenario coverage of training data.

  • claude-opus-4.7: Strong in Data Boundary and engineering standards, weak in Resource Constraints
  • gemini-2.5-pro: Full marks in Business Rules, only 2 points in Data Boundary
  • deepseek-v4-pro: 3 points in Business Rules, 1.5 points in Resource Constraints

Specific Recommendations for Enterprise Model Selection

If the core scenario is Data Boundary and engineering standards, prioritize claude-opus-4.7; if strict execution of business approval processes is required, gemini-2.5-pro and gpt-o3 are more reliable; currently there is no model with absolute advantage in the Resource Constraints scenario, and gemini-3.1-pro is relatively the most stable but still requires additional manual verification. In the Security & Compliance scenario, claude-sonnet-4.6 and qwen3-max are tied for the lead, making them alternatives for compliance-sensitive businesses.

The pilot phase has clearly shown that no model maintains a lead across all five scenarios. Enterprises must abandon the "one-size-fits-all" selection approach and match models according to the actual constraint type, otherwise rules are very likely to be broken during the R3 pressure round.

Resource constraints will become the biggest bottleneck for model iteration in the next phase. Whoever breaks through first will gain a decisive advantage in enterprise-level compliance testing.

Data source: YZ Index WDCD Compliance Leaderboard | Run #157 · Scenario Matrix | Evaluation Methodology