Resource Constraints Become the Hardest Scenario in WDCD, 豆包 Scores 3.5 Points in Business Rules, Surpassing GPT

The most striking result of the WDCD five-scenario evaluation is that the resource constraints scenario scored the lowest overall, with champion Claude Opus 4.7 only achieving 2.67 points, and 豆包Pro directly dropping to 1.5 points. This means that under the most common enterprise constraints of "compute quota, concurrency limits, timeout retries," current large models collectively exhibit the weakest "rule-abiding capability."

Why Resource Constraints Become the Biggest Stumbling Block

In the three-round dialogue design, the R3 direct pressure scenario requires the model to refuse unauthorized scaling even when resources are exhausted. Claude Opus 4.7 leads with 2.67 points, but the gap with second-place Claude Sonnet 4.6 at 2.33 points is only 0.34 points, indicating low overall differentiation in this scenario, yet it drags all models near the passing line. GPT-5.5 scores only 2.17 points, Qwen3-Max also 2.17 points, and 豆包Pro bottoms out at 1.5 points, exposing its susceptibility to interference and abandonment of constraints in multi-round resource negotiations.

Business Rules Scenario Shows Greatest Differentiation

In contrast, in the business rules scenario, 豆包Pro takes the lead with 3.5 points, followed closely by GPT-5.5 at 3.33 points, while Gemini 3.1 Pro and Grok-4 both bottom out at 2.33 points, with a range of 1.17 points—the largest gap among the five scenarios. After the R2 irrelevant topic interference, 豆包 can still adhere to the hard rule of "only allowing specific roles to modify the approval flow" in R3, indicating more thorough training on enterprise process constraints.

豆包Pro resource constraints 1.5 points vs business rules 3.5 points, a gap of 2 points, showing the most severe imbalance.

Model Specialization Map

  • GPT-5.5: Safety compliance 3.5 points (top), but resource constraints only 2.17 points, suitable for finance and healthcare scenarios with extremely high compliance requirements.
  • Claude Opus 4.7: Resource constraints 2.67 points + engineering standards 2.75 points, dual champion, suitable for R&D teams requiring strict compute control and code standards.
  • DeepSeek-V4-Pro: Business rules 3 points (acceptable), resource constraints only 2 points, a gap of 1 point, indicating weakness in long-context resource management.
  • Qwen3-Max: Safety compliance 3.33 points (impressive), but engineering standards drop to 2 points, with engineering constraint ability clearly weaker than safety.

Specific Recommendations for Enterprise Selection

If the enterprise's core pain points are API quotas and concurrency control, choose Claude Opus 4.7 first; if business rules such as approval flows and permission matrices are the strictest, 豆包Pro currently performs most stably; for safety compliance scenarios, GPT-5.5 and GPT-o3 remain the top choices; for engineering standards, consider the Claude twins or Gemini 2.5 Pro.

Overall, no model leads across all five scenarios, so selection must be evaluated per scenario. The collective low scores in the resource constraints scenario also remind vendors that the next phase of model iteration should focus on the rule-abiding capability in multi-round resource negotiations.

Prediction: If the average score of the resource constraints scenario cannot surpass 3.0 points by Q3 2025, any model claiming "enterprise-grade reliability" will lack credibility.


Data source: YZ Index WDCD Rule-Abiding Ranking | Run #120 · Scenario Matrix | Evaluation Methodology