The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue. In Run #164, completed on 2026-06-11, 11 models were evaluated across three rounds, producing an average instruction decay of -44.3% from Round 1 to Round 3.
WDCD's three-round protocol tests instruction acknowledgment (R1), distractor resistance after injecting 2,000–5,000-word professional documents (R2), and final constraint integrity (R3). All scoring is 100% rule-based, with zero AI judges in the loop. The 30-question set spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.
Leaderboard — Top 3:
- GPT-5.5 — 88.3 pts (decay: -67%)
- Gemini 3.1 Pro — 87.5 pts (decay: -60%)
- Claude Sonnet 4.6 — 83.3 pts (decay: -57.7%)
The top three models posted the highest absolute scores but, notably, all three also exhibited substantial decay between rounds — confirming a pattern seen in prior runs: peak capability does not guarantee multi-turn commitment. Higher initial compliance often leaves more room to fall once distractor documents are introduced in R2.
Decay extremes:
- Worst decay: GPT-o3 at -24.7%, indicating significant erosion of instruction adherence by R3 despite a stable R1 baseline.
- Best decay resistance: 豆包 Pro at -110%, the strongest resistance metric recorded in this run, meaning its R3 constraint integrity actually exceeded its R1 baseline under the WDCD scoring formula.
The -44.3% run-wide average reinforces a structural finding from previous WDCD runs: instruction decay is not a tail-risk phenomenon but a baseline characteristic of current frontier models. When long professional documents are introduced mid-conversation, the majority of tested models partially or fully release earlier constraints — even when those constraints were explicitly acknowledged in R1.
Compared to prior runs, the relative ordering at the top remains consistent, but the gap between raw capability scores and decay-resistance scores continues to widen. This suggests that multi-turn commitment is emerging as a distinct evaluation axis, not a derivative of single-turn capability. Models optimized for benchmark accuracy do not automatically inherit constraint persistence.
For scenario-level breakdowns, scoring rubrics, and the full R1/R2/R3 protocol specification, see the WDCD methodology document: https://www.winzheng.com/yz-index/methodology
Raw run data is available via the WDCD data API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接