The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue. In Run #100, conducted on 2026-05-03, the average instruction decay across 11 tested models reached 39.1% from Round 1 to Round 3, confirming that multi-turn commitment remains an unsolved problem even at the frontier.
WDCD applies a three-round structure: R1 establishes instruction acknowledgment, R2 inserts 2,000–5,000 word professional documents as distractors, and R3 performs a final constraint integrity check. All 30 questions span five real-world scenarios — data_boundary, resource_limit, business_rule, security, and engineering — and are scored entirely by deterministic rules, with zero AI judges in the loop.
Top Performers
- Claude Opus 4.7 — 67.5 pts, decay −23%
- GPT-o3 — 66.7 pts, decay −33%
- Claude Sonnet 4.6 — 63.3 pts, decay −30%
Claude Opus 4.7 retained the leadership position by combining the highest absolute score with one of the lowest decay rates in the run. GPT-o3 finished a close second on points but lost more ground between rounds, suggesting stronger initial compliance but weaker resilience under distractor load.
Decay Patterns
The spread between best and worst decay resistance widened in this run. 豆包 Pro (Doubao Pro) recorded the strongest resistance at only −18.2%, outperforming every Western frontier model on the decay-stability axis despite not topping the absolute score table. At the other end, Grok 4 collapsed by −74% between R1 and R3 — the worst figure in Run #100 — indicating that initial instruction acknowledgment did not translate into sustained constraint adherence once long professional documents were introduced in R2.
Across the field, instruction decay correlated more strongly with R2 distractor resistance than with R1 acknowledgment quality. Models that scored well in R1 but lacked stable context anchoring tended to drift on business_rule and data_boundary scenarios by R3, while models with lower headline scores but flatter decay curves (such as 豆包 Pro) closed much of the gap by the final round.
Notable Movement
Compared with prior runs, the top three remained Claude- and GPT-family models, but the decay-resistance leaderboard continues to diverge from the absolute-score leaderboard — a signal that raw capability and multi-turn commitment are increasingly separable evaluation axes.
Full methodology: https://www.winzheng.com/yz-index/methodology
Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接