WDCD Run #100: Average Instruction Decay Hits 39.1% Across 11 Models, Claude Opus 4.7 Leads

May 5, 2026 562 approx.7min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue. In Run #100, conducted on 2026-05-03, the average instruction decay across 11 tested models reached 39.1% from Round 1 to Round 3, confirming that multi-turn commitment remains an unsolved problem even at the frontier.

WDCD applies a three-round structure: R1 establishes instruction acknowledgment, R2 inserts 2,000–5,000 word professional documents as distractors, and R3 performs a final constraint integrity check. All 30 questions span five real-world scenarios — data_boundary, resource_limit, business_rule, security, and engineering — and are scored entirely by deterministic rules, with zero AI judges in the loop.

Top Performers

Claude Opus 4.7 — 67.5 pts, decay −23%
GPT-o3 — 66.7 pts, decay −33%
Claude Sonnet 4.6 — 63.3 pts, decay −30%

Claude Opus 4.7 retained the leadership position by combining the highest absolute score with one of the lowest decay rates in the run. GPT-o3 finished a close second on points but lost more ground between rounds, suggesting stronger initial compliance but weaker resilience under distractor load.

Decay Patterns

The spread between best and worst decay resistance widened in this run. 豆包 Pro (Doubao Pro) recorded the strongest resistance at only −18.2%, outperforming every Western frontier model on the decay-stability axis despite not topping the absolute score table. At the other end, Grok 4 collapsed by −74% between R1 and R3 — the worst figure in Run #100 — indicating that initial instruction acknowledgment did not translate into sustained constraint adherence once long professional documents were introduced in R2.

Across the field, instruction decay correlated more strongly with R2 distractor resistance than with R1 acknowledgment quality. Models that scored well in R1 but lacked stable context anchoring tended to drift on business_rule and data_boundary scenarios by R3, while models with lower headline scores but flatter decay curves (such as 豆包 Pro) closed much of the gap by the final round.

Notable Movement

Compared with prior runs, the top three remained Claude- and GPT-family models, but the decay-resistance leaderboard continues to diverge from the absolute-score leaderboard — a signal that raw capability and multi-turn commitment are increasingly separable evaluation axes.

Full methodology: https://www.winzheng.com/yz-index/methodology
Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd

Top Performers

Decay Patterns

Notable Movement

Related Articles