WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%

WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding an average commitment decay of 43.3%. Qwen3 Max topped the leaderboard with 72.5 points and just 10% decay, while Grok 4 recorded the steepest drop at 70%.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably large language models hold onto user instructions across multi-turn dialogue. In Run #135, executed on 2026-05-27 across 11 models, the field averaged 43.3% commitment decay between Round 1 and Round 3 — confirming that instruction decay remains a structural weakness across the current generation of frontier models.

WDCD evaluates multi-turn commitment through three rounds: Round 1 verifies initial instruction acknowledgment, Round 2 introduces distractor content via 2,000–5,000 word professional documents, and Round 3 performs a final constraint integrity check. Scoring is 100% rule-based, with zero AI judges in the loop. The 30-question set spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #135 Top 3:

  • Qwen3 Max — 72.5 points, –10% decay (best decay resistance in the run)
  • Claude Sonnet 4.6 — 65 points, –30% decay
  • DeepSeek V4 Pro — 62.5 points, –40% decay

Qwen3 Max's 10% decay figure is notable because it sits well below the 43.3% field average, indicating that its instruction-following held nearly intact even after the Round 2 distractor documents were injected. Claude Sonnet 4.6 and DeepSeek V4 Pro followed a more typical pattern: strong initial acknowledgment in R1, partial slippage under distractor load in R2, and measurable constraint erosion by R3.

At the other end of the distribution, Grok 4 recorded the worst result of the run with –70% decay, meaning the model retained less than a third of its initial commitment strength by the final round. This represents the widest gap between the leading and trailing models observed in the current cohort, and underscores how unevenly instruction decay is distributed across model families.

Two broader patterns are visible in Run #135. First, raw Round 1 scores are a poor predictor of final standing — several models that acknowledged instructions cleanly in R1 lost more than half their commitment by R3. Second, decay resistance is becoming a stronger differentiator than initial capability: the spread between Qwen3 Max (–10%) and Grok 4 (–70%) is significantly larger than the spread in their R1 acknowledgment behavior.

Full methodology and scoring rubric: https://www.winzheng.com/yz-index/methodology

Machine-readable run data: https://www.winzheng.com/yz-index/api/v1/dcd