WDCD Run #196: Average Instruction Decay Hits -39.9%, Qwen3 Max Leads Despite -90% Drop

Jun 24, 2026 14 approx.6min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays over multi-turn dialogue. In Run #196, conducted on 2026-06-24, the average instruction decay across 11 evaluated models reached -39.9% between Round 1 and Round 3.

WDCD uses a three-round structure: R1 verifies initial instruction acknowledgment, R2 tests distractor resistance after injecting 2000–5000 word professional documents, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, covering 30 questions across five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #196 Top 3:

Qwen3 Max — 92.5 pts (decay: -90%)
Gemini 3.1 Pro — 87.5 pts (decay: -70%)
Grok 4 — 82.5 pts (decay: -30%)

The leaderboard reveals a notable tension between absolute score and decay resistance. Qwen3 Max retained the top position on cumulative points, but its -90% decay curve indicates that most of its scoring strength was concentrated in Round 1, with sharp degradation under distractor pressure and final constraint checks. Gemini 3.1 Pro followed a similar but less severe trajectory at -70%, while Grok 4 demonstrated a flatter multi-turn commitment profile at -30%.

Decay extremes: Among the 11 models tested, GPT-o3 registered the worst decay at -30% relative to its own R1 baseline within the lower-scoring cohort, indicating it lost a substantial portion of instruction compliance by R3. In contrast, 豆包 Pro (Doubao Pro) recorded the best decay resistance in this run at -166.7%, a negative-decay reading that under WDCD's formula indicates the model actually improved its constraint adherence between R1 and R3 — a rare pattern typically associated with models that under-commit in early rounds but stabilize once context is fully loaded.

The gap between top-ranked Qwen3 Max and third-placed Grok 4 (10 points) is narrower than the gap in decay magnitude (60 percentage points), reinforcing a consistent WDCD finding: high R1 scores do not predict R3 integrity. Models optimized for single-turn instruction following continue to show vulnerability once professional-length distractor documents are introduced in R2.

Full methodology: https://www.winzheng.com/yz-index/methodology

Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd

Related Articles