WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57

Jun 28, 2026 47 Views - Read Source Winzheng Index

WDCD Compliance Test 安全合规场景横评模型选型

In the WDCD Compliance Test, the safety compliance scenario scored the lowest on average across all models, with the highest score being only 3.57/4 for deepseek-v4-pro, while claude-sonnet-4.6 scored only 2.57/4.

Safety Compliance Becomes the Most Difficult Scenario

Among the five scenarios, safety compliance scores were generally low. deepseek-v4-pro ranked first with 3.57/4, claude-opus-4.7 and qwen3-max both scored 3.43/4, and gemini-3.1-pro scored 3.29/4. The lowest score, claude-sonnet-4.6, was only 2.57/4, a full 1 point behind the leader. In contrast, gemini-3.1-pro achieved a perfect 4/4 in the data boundary scenario, and also scored 4/4 in the resource limitation scenario, indicating that models' ability to withstand three-round dialogues under safety compliance constraints is significantly weaker than in other dimensions.

Safety Compliance Still Shows the Largest Differentiation

The safety compliance scenario not only had the lowest average score but also the largest score difference between models. The range from 3.57/4 to 2.57/4 spanned 1 point. In the engineering standards scenario, the highest score was doubao-pro at 3.8/4, and the lowest was qwen3-max at 2.8/4, also a 1-point gap, but the overall average was higher. In the business rules scenario, grok-4 scored 4/4, ernie-4.5 and gpt-o3 both scored 3.14/4, a gap of 0.86 points, making it the second most differentiated. In the data boundary and resource limitation scenarios, the score gaps were both less than 0.75 points, indicating relatively concentrated model performance.

Obvious Subject Imbalance

claude-sonnet-4.6 scored 3.57/4 in business rules but only 2.57/4 in safety compliance, a 1-point gap between scenarios. gemini-3.1-pro achieved 4/4 in both data boundary and resource limitation, but scored only 3.29/4 in safety compliance and 3.6/4 in engineering standards, showing a clear weakness under safety-related constraints. grok-4 scored 4/4 in business rules and 3.8/4 in engineering standards, but only 3.29/4 in safety compliance. doubao-pro led in engineering standards with 3.8/4, but scored only 3/4 in data boundary and 2.88/4 in resource limitation, also showing a significant imbalance.

Specific Recommendations for Enterprise Model Selection

For enterprises requiring strict data boundary and resource limitation, gemini-3.1-pro is currently the most stable choice, scoring 4/4 in both scenarios. For scenarios emphasizing business rule implementation, grok-4 stands out with 4/4 and can be prioritized. For scenarios with high engineering standards requirements, doubao-pro and grok-4 are tied at 3.8/4 and can be considered as alternatives. For scenarios with high safety compliance requirements, currently no model scores exceed 3.57/4; it is recommended to use them in conjunction with manual review and not rely on a single model for now.

In the resource limitation scenario, gpt-o3 scored only 2.75/4, the only model among the 11 to score below 3, and claude-sonnet-4.6 also scored only 2.88/4 in this scenario, indicating that some models easily exceed resource limits after multiple rounds of interference.

Safety compliance remains the biggest weakness in current models' compliance ability. Enterprises should apply separate weighted evaluation for this scenario when selecting models.

Data source: YZ Index WDCD Compliance Leaderboard | Run #202 · Scenario Matrix | Evaluation Methodology

WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57

Safety Compliance Becomes the Most Difficult Scenario

Safety Compliance Still Shows the Largest Differentiation

Obvious Subject Imbalance

Specific Recommendations for Enterprise Model Selection

Related Reviews

Winzheng Index R3 Collapse Rate 56.7%! GPT-o3 Most Hypocritical in Three-Round Compliance Test

Winzheng Index R3 Collapse Rate Differs by 7x! Real Attenuation of 11 Models in WDCD Three-Round Commitment

Winzheng Index 11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap

Winzheng Index WDCD Review Reveals: Business Rules Become a Collective Waterloo for 11 Models, Security Compliance Differentiation Maxes Out at 2 Points