WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points

Jun 17, 2026 20 approx.6min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades across multi-turn dialogue. In Run #185, executed on 2026-06-17 against 11 models, the average commitment decay from Round 1 to Round 3 reached -57.5%, confirming that instruction decay remains the dominant failure mode in extended professional conversations.

WDCD evaluates models across three rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional documents as distractors, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied across 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #185 Top 3:

Qwen3 Max — 92.5 points, -90% decay
Claude Sonnet 4.6 — 90 points, -80% decay
DeepSeek V4 Pro — 87.5 points, -70% decay

The top-ranked models cluster tightly on absolute score but still show substantial commitment loss between R1 and R3. Qwen3 Max's -90% decay indicates that even the run's highest scorer surrenders most of its initial constraint adherence once long-document distractors are introduced in R2 — a pattern consistent with prior WDCD runs in which absolute accuracy and decay magnitude do not move in lockstep.

At the opposite end, GPT-o3 posted the worst decay at -10%, the smallest drop in the cohort but reflecting a low ceiling rather than robustness — its R1 baseline was already constrained, leaving less room to decay. The clearest counter-trend came from 文心一言 4.5, which recorded the strongest decay resistance at -111.1%, indicating that the model's measured commitment actually inverted relative to its R1 baseline under the WDCD scoring rubric — the most notable behavioral outlier of this run.

Across the 11-model cohort, the -57.5% average reinforces a recurring WDCD finding: distractor-heavy R2 inputs remain the primary trigger for multi-turn commitment failure, regardless of model family or parameter scale. Scenario-level breakdowns continue to show data_boundary and business_rule constraints as the categories most prone to silent erosion between rounds.

Full scoring methodology: https://www.winzheng.com/yz-index/methodology

Run #185 raw data API: https://www.winzheng.com/yz-index/api/v1/dcd

Related Articles