WDCD Run #164: Average Instruction Decay Hits -44.3% Across 11 Frontier Models

WDCD Run #164 (2026-06-11) evaluated 11 frontier models across three dialogue rounds, recording an average commitment decay of -44.3% from R1 to R3. GPT-5.5 led the leaderboard at 88.3 points, while 豆包 Pro showed the strongest decay resistance.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue. In Run #164, completed on 2026-06-11, 11 models were evaluated across three rounds, producing an average instruction decay of -44.3% from Round 1 to Round 3.

WDCD's three-round protocol tests instruction acknowledgment (R1), distractor resistance after injecting 2,000–5,000-word professional documents (R2), and final constraint integrity (R3). All scoring is 100% rule-based, with zero AI judges in the loop. The 30-question set spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Leaderboard — Top 3:

  • GPT-5.5 — 88.3 pts (decay: -67%)
  • Gemini 3.1 Pro — 87.5 pts (decay: -60%)
  • Claude Sonnet 4.6 — 83.3 pts (decay: -57.7%)

The top three models posted the highest absolute scores but, notably, all three also exhibited substantial decay between rounds — confirming a pattern seen in prior runs: peak capability does not guarantee multi-turn commitment. Higher initial compliance often leaves more room to fall once distractor documents are introduced in R2.

Decay extremes:

  • Worst decay: GPT-o3 at -24.7%, indicating significant erosion of instruction adherence by R3 despite a stable R1 baseline.
  • Best decay resistance: 豆包 Pro at -110%, the strongest resistance metric recorded in this run, meaning its R3 constraint integrity actually exceeded its R1 baseline under the WDCD scoring formula.

The -44.3% run-wide average reinforces a structural finding from previous WDCD runs: instruction decay is not a tail-risk phenomenon but a baseline characteristic of current frontier models. When long professional documents are introduced mid-conversation, the majority of tested models partially or fully release earlier constraints — even when those constraints were explicitly acknowledged in R1.

Compared to prior runs, the relative ordering at the top remains consistent, but the gap between raw capability scores and decay-resistance scores continues to widen. This suggests that multi-turn commitment is emerging as a distinct evaluation axis, not a derivative of single-turn capability. Models optimized for benchmark accuracy do not automatically inherit constraint persistence.

For scenario-level breakdowns, scoring rubrics, and the full R1/R2/R3 protocol specification, see the WDCD methodology document: https://www.winzheng.com/yz-index/methodology

Raw run data is available via the WDCD data API: https://www.winzheng.com/yz-index/api/v1/dcd