DCD · Dynamic Context Decay
After 5000 characters of distraction, does the AI still remember what you said three minutes ago?
DCD Leaderboard
| # | Model | WDCD | R1 Understanding | R2 Resistance | R3 Integrity | Main Score | vs Main Rank |
|---|---|---|---|---|---|---|---|
| #1 | Qwen3 Max | 92.5 | 100% | 80% | 95% | 93.1 | ↑1 |
| #2 | Claude Sonnet 4.6 | 90.0 | 100% | 80% | 90% | 91.2 | ↑3 |
| #3 | DeepSeek V4 Pro | 87.5 | 100% | 80% | 85% | 92.0 | ↑1 |
| #4 | Claude Opus 4.7 | 85.0 | 100% | 80% | 80% | 95.3 | ↓3 |
| #5 | ERNIE Bot 4.5 | 82.5 | 90% | 50% | 95% | 77.1 | ↑4 |
| #6 | Grok 4 | 82.5 | 100% | 80% | 75% | 88.0 | ↑1 |
| #7 | Gemini 2.5 Pro | 80.0 | 100% | 90% | 65% | 76.0 | ↑4 |
| #8 | Gemini 3.1 Pro | 80.0 | 100% | 70% | 75% | 76.3 | ↑2 |
| #9 | GPT-5.5 | 77.5 | 100% | 80% | 65% | 92.5 | ↓6 |
| #10 | GPT-o3 | 70.0 | 100% | 90% | 45% | 89.6 | ↓4 |
| #11 | Doubao Pro | 62.5 | 70% | 60% | 60% | 87.6 | ↓3 |
Constraint Adherence Curve
Each row represents a model. The three color bars represent the scoring rates for R1 (understanding), R2 (anti-interference), R3 (constraint adherence) respectively.
Performance Across Five Constraint Types
Which model is most likely to fail under which type of constraint?
| Model | Data Boundary | Resource Limit | Business Rule | Security | Engineering |
|---|---|---|---|---|---|
| Qwen3 Max | 88 | 100 | 88 | 88 | 100 |
| Claude Sonnet 4.6 | 100 | 88 | 88 | 75 | 100 |
| DeepSeek V4 Pro | 88 | 88 | 75 | 88 | 100 |
| Claude Opus 4.7 | 88 | 100 | 75 | 75 | 88 |
| ERNIE Bot 4.5 | 88 | 75 | 88 | 75 | 88 |
| Grok 4 | 88 | 100 | 75 | 88 | 63 |
| Gemini 2.5 Pro | 50 | 75 | 88 | 88 | 100 |
| Gemini 3.1 Pro | 75 | 75 | 63 | 88 | 100 |
| GPT-5.5 | 75 | 100 | 63 | 63 | 88 |
| GPT-o3 | 63 | 63 | 100 | 50 | 75 |
| Doubao Pro | 88 | 63 | 38 | 75 | 50 |
Notable Failure Cases
R1 confirms understanding of the constraint → R3 fully compromises cases (dialogue desensitized display).
🏛 Why We Made WDCD
Design philosophy, differences from existing evaluations, roadmap — the complete story of the world's first multi-round commitment evaluation framework.
📋 Methodology
How does WDCD test? How are the three rounds of dialogue designed? How is scoring 100% auditable?
📊 API Interface
Get WDCD raw data for third-party research and visualization.
📰 All Cases
Complete list of all cases where R1 passed but R3 collapsed.