DCD · Dynamic Context Decay
After 5000 characters of distraction, does the AI still remember what you said three minutes ago?
DCD Leaderboard
| # | Model | WDCD | R1 Understanding | R2 Resistance | R3 Integrity | Main Score | vs Main Rank |
|---|---|---|---|---|---|---|---|
| #1 | Qwen3 Max | 65.0 | 100% | 90% | 35% | 77.2 | ↑7 |
| #2 | Claude Sonnet 4.6 | 62.5 | 100% | 100% | 25% | 83.5 | ↓1 |
| #3 | DeepSeek V4 Pro | 62.5 | 100% | 80% | 35% | 77.7 | ↑4 |
| #4 | 文心一言 4.5 | 62.5 | 80% | 90% | 40% | 78.2 | ↑2 |
| #5 | GPT-o3 | 62.5 | 100% | 90% | 30% | 75.7 | ↑4 |
| #6 | Claude Opus 4.7 | 60.0 | 100% | 80% | 30% | 81.1 | ↓3 |
| #7 | Gemini 2.5 Pro | 60.0 | 100% | 90% | 25% | 78.5 | ↓2 |
| #8 | Gemini 3.1 Pro | 60.0 | 100% | 100% | 20% | 79.2 | ↓4 |
| #9 | 豆包 Pro | 55.0 | 70% | 100% | 25% | 82.6 | ↓7 |
| #10 | GPT-5.5 | 55.0 | 100% | 80% | 20% | 73.2 | — |
| #11 | Grok 4 | 50.0 | 100% | 80% | 10% | 49.2 | — |
Constraint Adherence Curve
Each row represents a model. The three color bars represent the scoring rates for R1 (understanding), R2 (anti-interference), R3 (constraint adherence) respectively.
Performance Across Five Constraint Types
Which model is most likely to fail under which type of constraint?
| Model | Data Boundary | Resource Limit | Business Rule | Security | Engineering |
|---|---|---|---|---|---|
| Qwen3 Max | 88 | 50 | 75 | 63 | 50 |
| Claude Sonnet 4.6 | 63 | 63 | 75 | 63 | 50 |
| DeepSeek V4 Pro | 63 | 38 | 88 | 75 | 50 |
| 文心一言 4.5 | 63 | 63 | 63 | 63 | 63 |
| GPT-o3 | 63 | 50 | 63 | 88 | 50 |
| Claude Opus 4.7 | 63 | 75 | 63 | 38 | 63 |
| Gemini 2.5 Pro | 75 | 50 | 75 | 63 | 38 |
| Gemini 3.1 Pro | 63 | 50 | 75 | 63 | 50 |
| 豆包 Pro | 50 | 50 | 75 | 63 | 38 |
| GPT-5.5 | 63 | 50 | 38 | 75 | 50 |
| Grok 4 | 50 | 50 | 63 | 50 | 38 |
Notable Failure Cases
R1 confirms understanding of the constraint → R3 fully compromises cases (dialogue desensitized display).
🏛 Why We Made WDCD
Design philosophy, differences from existing evaluations, roadmap — the complete story of the world's first multi-round commitment evaluation framework.
📋 Methodology
How does WDCD test? How are the three rounds of dialogue designed? How is scoring 100% auditable?
📊 API Interface
Get WDCD raw data for third-party research and visualization.
📰 All Cases
Complete list of all cases where R1 passed but R3 collapsed.