WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57

In the WDCD Compliance Test, the safety compliance scenario scored the lowest on average across all models, with the highest score being only 3.57/4 for deepseek-v4-pro, while claude-sonnet-4.6 scored only 2.57/4.

Safety Compliance Becomes the Most Difficult Scenario

Among the five scenarios, safety compliance scores were generally low. deepseek-v4-pro ranked first with 3.57/4, claude-opus-4.7 and qwen3-max both scored 3.43/4, and gemini-3.1-pro scored 3.29/4. The lowest score, claude-sonnet-4.6, was only 2.57/4, a full 1 point behind the leader. In contrast, gemini-3.1-pro achieved a perfect 4/4 in the data boundary scenario, and also scored 4/4 in the resource limitation scenario, indicating that models' ability to withstand three-round dialogues under safety compliance constraints is significantly weaker than in other dimensions.

Safety Compliance Still Shows the Largest Differentiation

The safety compliance scenario not only had the lowest average score but also the largest score difference between models. The range from 3.57/4 to 2.57/4 spanned 1 point. In the engineering standards scenario, the highest score was doubao-pro at 3.8/4, and the lowest was qwen3-max at 2.8/4, also a 1-point gap, but the overall average was higher. In the business rules scenario, grok-4 scored 4/4, ernie-4.5 and gpt-o3 both scored 3.14/4, a gap of 0.86 points, making it the second most differentiated. In the data boundary and resource limitation scenarios, the score gaps were both less than 0.75 points, indicating relatively concentrated model performance.

Obvious Subject Imbalance

claude-sonnet-4.6 scored 3.57/4 in business rules but only 2.57/4 in safety compliance, a 1-point gap between scenarios. gemini-3.1-pro achieved 4/4 in both data boundary and resource limitation, but scored only 3.29/4 in safety compliance and 3.6/4 in engineering standards, showing a clear weakness under safety-related constraints. grok-4 scored 4/4 in business rules and 3.8/4 in engineering standards, but only 3.29/4 in safety compliance. doubao-pro led in engineering standards with 3.8/4, but scored only 3/4 in data boundary and 2.88/4 in resource limitation, also showing a significant imbalance.

Specific Recommendations for Enterprise Model Selection

For enterprises requiring strict data boundary and resource limitation, gemini-3.1-pro is currently the most stable choice, scoring 4/4 in both scenarios. For scenarios emphasizing business rule implementation, grok-4 stands out with 4/4 and can be prioritized. For scenarios with high engineering standards requirements, doubao-pro and grok-4 are tied at 3.8/4 and can be considered as alternatives. For scenarios with high safety compliance requirements, currently no model scores exceed 3.57/4; it is recommended to use them in conjunction with manual review and not rely on a single model for now.

In the resource limitation scenario, gpt-o3 scored only 2.75/4, the only model among the 11 to score below 3, and claude-sonnet-4.6 also scored only 2.88/4 in this scenario, indicating that some models easily exceed resource limits after multiple rounds of interference.

Safety compliance remains the biggest weakness in current models' compliance ability. Enterprises should apply separate weighted evaluation for this scenario when selecting models.

Data source: YZ Index WDCD Compliance Leaderboard | Run #202 · Scenario Matrix | Evaluation Methodology