WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top

WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 24.7% from Round 1 to Round 3. Claude Opus 4.7, GPT-5.5, and GPT-o3 tied for first place at 70 points with only -10% decay each.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably AI models hold onto user instructions across multi-turn dialogue, using 100% rule-based scoring and zero AI judges. In Run #146 (2026-06-03), 11 models were evaluated across 30 questions spanning five real-world scenarios — data_boundary, resource_limit, business_rule, security, and engineering — producing an average instruction decay of 24.7% from Round 1 to Round 3.

Each WDCD run follows a fixed three-round structure: R1 establishes instruction acknowledgment, R2 inserts 2,000–5,000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. This design isolates multi-turn commitment from single-turn compliance, which most benchmarks conflate.

Top of the leaderboard. Three models tied at the top with 70 points and matching -10% decay:

  • Claude Opus 4.7 — 70 pts, -10% decay
  • GPT-5.5 — 70 pts, -10% decay
  • GPT-o3 — 70 pts, -10% decay

The three-way tie at the top is notable: all three frontier models retained the same proportion of their R1 commitments through R3, suggesting a converging ceiling for instruction-following architectures under WDCD's distractor pressure.

Decay resistance leader. While not topping the absolute score, 豆包 Pro (Doubao Pro) recorded the best decay resistance of the run at -0% — meaning its Round 3 commitment level matched Round 1 exactly. This is the standout structural result of Run #146: a model that does not lose ground under distractor load, even if its starting compliance is lower than the top tier.

Worst decay. At the other end, Grok 4 posted a -50% decay, losing half of its initial commitments by Round 3. This is the largest gap between R1 acknowledgment and R3 integrity recorded among the 11 models in this run, and it drove a significant portion of the 24.7% fleet average.

Pattern summary. Run #146 reinforces a recurring WDCD observation: instruction decay is not uniform across vendors. The spread between best (-0%) and worst (-50%) decay resistance is 50 percentage points, even though all models were given identical prompts, identical distractor documents, and identical rule-based grading. R2's long professional documents remain the most reliable stressor for exposing weak multi-turn commitment.

Full methodology: https://www.winzheng.com/yz-index/methodology

Raw run data (JSON): https://www.winzheng.com/yz-index/api/v1/dcd