WDCD Run #115: Average Instruction Decay Hits 49.2% as Gemini 3.1 Pro and Qwen3 Max Tie for First

May 13, 2026 13 approx.7min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue. In Run #115, completed on 2026-05-13, eleven models were tested and the cohort averaged a 49.2% commitment decay between Round 1 and Round 3 — the headline finding of this evaluation cycle.

WDCD scoring is fully rule-based with zero AI judges, spanning 30 questions across five real-world scenario categories: data_boundary, resource_limit, business_rule, security, and engineering. Each model proceeds through three rounds: R1 verifies initial instruction acknowledgment; R2 introduces 2,000–5,000 word professional distractor documents to test resistance; and R3 performs a final constraint integrity check.

Run #115 Top Results

Gemini 3.1 Pro — 65 points, -30% decay (best decay resistance in cohort)
Qwen3 Max — 65 points, -30% decay (tied for first)
DeepSeek V4 Pro — 62.5 points, -40% decay

Gemini 3.1 Pro and Qwen3 Max share the top position with identical 65-point totals and matching -30% decay curves, indicating both models retained roughly 70% of their original constraint adherence after exposure to long distractor documents and a final adversarial check. DeepSeek V4 Pro followed in third with a moderately steeper -40% decay, suggesting somewhat weaker resistance to R2's document-induced drift.

Decay Patterns

The dominant pattern in Run #115 is that instruction decay is not uniform across the cohort — it bifurcates sharply between models that hold context and those that collapse entirely. The most extreme case was Grok 4, which recorded a -100% decay, meaning its Round 1 commitments were effectively absent by Round 3. This represents complete loss of multi-turn commitment integrity under WDCD's distractor and integrity-check protocol.

With the cohort averaging 49.2% decay, the gap between the leading models (-30%) and the worst performer (-100%) is approximately 70 percentage points — the widest tail observed in recent runs. The clustering of Gemini 3.1 Pro and Qwen3 Max at exactly -30% suggests both vendors may be approaching a similar ceiling for current-generation context-retention techniques under WDCD's evaluation conditions.

Interpretation

R2 remains the decisive round. Inserting 2,000–5,000 word professional documents between instruction and verification continues to be the strongest stressor for commitment integrity, and Run #115's outcomes are largely shaped by how each model handled that segment rather than R1 acknowledgment, where most models performed adequately.

Full methodology: https://www.winzheng.com/yz-index/methodology
Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd

Run #115 Top Results

Decay Patterns

Interpretation

Related Articles