WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

May 20, 2026 674 approx.7min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue. In Run #125, completed on 2026-05-20, the average instruction decay across 11 tested models reached 63.6% from Round 1 to Round 3 — confirming that constraint erosion remains a structural weakness across the current generation of large language models.

WDCD evaluates multi-turn commitment through a fixed three-round protocol: R1 verifies initial instruction acknowledgment; R2 tests distractor resistance after the model processes 2,000–5,000 word professional documents; and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied across 30 questions spanning five real-world scenario categories: data_boundary, resource_limit, business_rule, security, and engineering.

Top 3 results, Run #125:

Claude Opus 4.7 — 65 points, decay −30% (best decay resistance in the run)
Claude Sonnet 4.6 — 62.5 points, decay −40%
豆包 Pro (Doubao Pro) — 60 points, decay −50%

At the other end of the field, DeepSeek V4 Pro recorded the steepest constraint collapse of the run at −90% decay, indicating that nearly all originally acknowledged commitments were abandoned or overwritten by the time R3 was reached. This pattern — strong R1 acknowledgment followed by near-total R3 dissolution — remains the dominant failure mode observed across WDCD runs.

Decay pattern observations: The 63.6% run-wide average sits within the historical band seen in prior WDCD runs, reinforcing the finding that instruction decay is driven primarily by long-context distractor exposure in R2 rather than by adversarial prompting. Models that hold position through R2 generally retain commitment through R3; models that yield in R2 rarely recover. Claude Opus 4.7's 30% decay represents the only single-digit-to-low-range result in the top tier, while the gap between the leader and the median model continues to widen on security and data_boundary scenarios.

Notably, the top two positions remain occupied by the Claude family, consistent with recent runs, while 豆包 Pro's entry into the top 3 marks a continued upward trajectory for domestic Chinese models in multi-turn commitment scoring, despite a 50% decay rate that still trails the leaders.

Full methodology: https://www.winzheng.com/yz-index/methodology

Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd

Related Articles