WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points

WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -57.5% from Round 1 to Round 3. Qwen3 Max topped the run at 92.5 points, while 文心一言 4.5 showed the strongest decay resistance.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue. In Run #185, executed on 2026-06-17 against 11 models, the average commitment decay from Round 1 to Round 3 reached -57.5%, confirming that instruction decay remains the dominant failure mode in extended professional conversations.

WDCD evaluates models across three rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional documents as distractors, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied across 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #185 Top 3:

  • Qwen3 Max — 92.5 points, -90% decay
  • Claude Sonnet 4.6 — 90 points, -80% decay
  • DeepSeek V4 Pro — 87.5 points, -70% decay

The top-ranked models cluster tightly on absolute score but still show substantial commitment loss between R1 and R3. Qwen3 Max's -90% decay indicates that even the run's highest scorer surrenders most of its initial constraint adherence once long-document distractors are introduced in R2 — a pattern consistent with prior WDCD runs in which absolute accuracy and decay magnitude do not move in lockstep.

At the opposite end, GPT-o3 posted the worst decay at -10%, the smallest drop in the cohort but reflecting a low ceiling rather than robustness — its R1 baseline was already constrained, leaving less room to decay. The clearest counter-trend came from 文心一言 4.5, which recorded the strongest decay resistance at -111.1%, indicating that the model's measured commitment actually inverted relative to its R1 baseline under the WDCD scoring rubric — the most notable behavioral outlier of this run.

Across the 11-model cohort, the -57.5% average reinforces a recurring WDCD finding: distractor-heavy R2 inputs remain the primary trigger for multi-turn commitment failure, regardless of model family or parameter scale. Scenario-level breakdowns continue to show data_boundary and business_rule constraints as the categories most prone to silent erosion between rounds.

Full scoring methodology: https://www.winzheng.com/yz-index/methodology

Run #185 raw data API: https://www.winzheng.com/yz-index/api/v1/dcd