The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue, using 100% rule-based scoring with zero AI judges. In Run #157, completed on 2026-06-10 across 11 models, the average commitment decay from Round 1 to Round 3 reached 47.7%, with three models tying at the top.
Top of the leaderboard. Three models finished at 67.5 points each:
- Claude Sonnet 4.6 — 67.5 pts, -30% decay
- Gemini 2.5 Pro — 67.5 pts, -20% decay
- Qwen3 Max — 67.5 pts, -30% decay
Among the three, Gemini 2.5 Pro shows the strongest multi-turn commitment, losing only 20% of its initial constraint adherence by Round 3. Claude Sonnet 4.6 and Qwen3 Max reached the same score via slightly higher Round 1 ceilings offset by steeper -30% drops.
Decay extremes. The widest spread in this run was between the worst and best decay performers:
- Worst decay: Grok 4 at -90%, meaning nearly all initial instruction commitments collapsed after the Round 2 distractor documents.
- Best decay resistance: 豆包 Pro at -16.7%, the most stable model across the three rounds despite not reaching the top-3 score bracket.
This divergence highlights that raw Round 1 acknowledgment quality does not predict downstream stability. A model can acknowledge constraints cleanly and still lose them once 2000–5000 word professional documents are injected in Round 2.
How WDCD works. Each run consists of three rounds: R1 verifies instruction acknowledgment, R2 tests distractor resistance after long professional documents are inserted into context, and R3 performs a final constraint integrity check. The 30-question suite spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering. Scoring is fully deterministic and rule-based.
Pattern notes. A 47.7% average decay across 11 models indicates that instruction decay remains a structural weakness rather than a model-specific defect. The clustering of three frontier models at an identical 67.5-point ceiling also suggests current top-tier systems hit a similar ceiling on multi-turn commitment when evaluated under deterministic rules — differences emerge primarily in the slope of decay, not the peak.
Full methodology: https://www.winzheng.com/yz-index/methodology
Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接