The WDCD compliance test, through three rounds of dialogue design, accurately exposes model failure points under real constraints. Pilot data shows that the business rules scenario has become a common weakness for all models, with the highest score being only 2.5, while the safety compliance scenario creates the widest gap among models.
Business Rules Become the Hardest Scenario, All Models Fail Collectively
In the business rules scenario, Doubao-pro, GPT-5.5, and GPT-o3 are tied for the highest score of 2.5, while the remaining 8 models are all stuck at 2 or 1.5. When R3 directly pressures models to violate pricing rules or approval processes, most models quickly compromise. In contrast, the data boundary scenario champion Claude-opus-4.7 scored 3, the resource limitation scenario also saw a score of 3, and the engineering standards scenario had Gemini-2.5-pro scoring 3. The low scores in business rules indicate that current models are far less reliable when handling internal corporate process constraints than when dealing with external safety red lines.
Safety Compliance Shows Greatest Differentiation, Claude-sonnet Takes the Lead
The safety compliance scenario has become the biggest watershed. Claude-sonnet-4.6 and Qwen3-max are tied at 3.5, while Ernie-4.5 is at the bottom with only 2, a gap of 1.5 points. If we calculate the variance of all model scores, the dispersion in safety compliance is significantly higher than in the other four scenarios. After R2 introduces irrelevant topics to distract, Claude-sonnet still maintains its compliance boundaries in R3, while Ernie-4.5 repeatedly relents under pressure. This directly explains why financial and healthcare enterprises tend to prefer Claude or Qwen.
The low scores in the business rules scenario expose the models' deficiency in understanding "implicit corporate contracts," rather than a mere lack of instruction-following ability.
Lopsided Performance Is Common, Six Models Show Scenario Score Gaps Exceeding 1 Point
Claude-opus-4.7 scored 3 in data boundary but only 1.5 in business rules, a gap of 1.5 points; Claude-sonnet-4.6 scored 3.5 in safety compliance compared to 1.5 in business rules, a gap of 2 points. Gemini-2.5-pro scored 3 in engineering standards but only 1.5 in data boundary, revealing a capability gap between code compliance and data privacy. Doubao-pro scored 3 in safety compliance but only 2 in engineering standards, indicating more extensive training in compliance review but insufficient control over resource quota.
- Strong in data boundary but weak in business rules: Claude-opus-4.7
- Strongest in safety compliance, weakest in business rules: Claude-sonnet-4.6
- Outstanding in engineering standards, weak in data boundary: Gemini-2.5-pro
Enterprise Selection Advice: Match by Scenario Rather Than Overall Rankings
For enterprises that emphasize internal approval and pricing strategies, Doubao-pro or the GPT series are preferred, even though they are not top in safety compliance, their business rule scores are clearly advantageous. For financial and government scenarios that require strict data boundary and safety compliance, Claude-sonnet-4.6 and Qwen3-max are more prudent choices. R&D teams with high engineering standards requirements may focus on Gemini-2.5-pro.
Overall, no current model leads across all five scenarios. Enterprises should abandon the fantasy of finding "the strongest model" and instead make targeted procurement based on the types of core business constraints. The scoring mechanism of WDCD also reminds us that a model's "compliance" capability is becoming a core indicator for the next phase of selection.
If the main ranking incorporates WDCD weights in the future, models with severe lopsidedness may experience drastic ranking fluctuations.
Data source: YZ Index WDCD Compliance Rankings | Run #125 · Scenario Matrix | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接