In the WDCD v3.1 compliance test, the business rules scenario scored the lowest among all models, with grok-4 leading at 3.5/4, while doubao-pro and qwen3-max only scored 1.55/4.
Business Rules Become the Hardest Scenario
The bottom score of 1.55/4 in the business rules scenario is lower than the bottom scores of the other four scenarios: Data Boundary at 1.92/4, Resource Constraints at 2.05/4, Security Compliance at 2.04/4, and Engineering Standards at 2.38/4. This scenario also shows the largest score gap, with a difference of 1.95 points between 3.5/4 and 1.55/4, demonstrating significantly higher differentiation than the other scenarios.
Security Compliance Scenario Has the Smallest Score Gap
The score distribution in the security compliance scenario is relatively concentrated. grok-4 leads with 3.86/4, qwen3-max trails with 2.04/4, and the range is 1.82 points. However, the median model scores mostly fall in the 2.7–3.2 range, indicating that most models have similar resilience under security compliance constraints.
Significant Model Specialization Imbalance
Claude-sonnet-4.6 scores 3.56/4 in Engineering Standards but only 1.8/4 in Business Rules, a gap of 1.76 points—the most severe imbalance in this test. Claude-opus-4.7 shows a 1.22-point gap between Engineering Standards (3.42/4) and Resource Constraints (2.2/4). GPT-5.5 has a 1.42-point gap between Engineering Standards (3.34/4) and Data Boundary (1.92/4). These differences indicate structural variations in models’ compliance capabilities under different constraint types.
grok-4 Consistently Leads Across All Scenarios
grok-4 achieves scores of 3.4/4, 3.62/4, 3.5/4, 3.86/4, and 3.7/4 across the five scenarios, ranking first in all, and leads the second-place model by over 0.6 points in both Security Compliance and Engineering Standards. Gemini-3.1-pro follows closely with 3.64/4 in Engineering Standards, but only scores 3.05/4 in Resource Constraints, revealing a clear weakness in resource-type constraints.
Recommendations for Enterprise Model Selection
Enterprises requiring strict business rule enforcement should prioritize grok-4, whose 3.5/4 score far exceeds the second-place gemini-3.1-pro and glm-4.6 at 2.85/4. For security compliance-focused scenarios, both grok-4 and claude-opus-4.7 (3.24/4, ranked second) can be considered. In high Engineering Standards scenarios, claude-sonnet-4.6 and gpt-o3 both achieve 3.56/4 and can serve as alternatives, but attention is needed for their low-score risk in the business rules scenario.
When constraint types shift from Security Compliance to Business Rules, model compliance capability may drop sharply. Enterprise model selection should match scenarios rather than rely on a single overall ranking.
Data Source: YZ Index WDCD Compliance Ranking | Run #211 · Scenario Matrix | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接