WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

Jun 10, 2026 821 approx.6min Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue, using 100% rule-based scoring with zero AI judges. In Run #157, completed on 2026-06-10 across 11 models, the average commitment decay from Round 1 to Round 3 reached 47.7%, with three models tying at the top.

Top of the leaderboard. Three models finished at 67.5 points each:

Claude Sonnet 4.6 — 67.5 pts, -30% decay
Gemini 2.5 Pro — 67.5 pts, -20% decay
Qwen3 Max — 67.5 pts, -30% decay

Among the three, Gemini 2.5 Pro shows the strongest multi-turn commitment, losing only 20% of its initial constraint adherence by Round 3. Claude Sonnet 4.6 and Qwen3 Max reached the same score via slightly higher Round 1 ceilings offset by steeper -30% drops.

Decay extremes. The widest spread in this run was between the worst and best decay performers:

Worst decay: Grok 4 at -90%, meaning nearly all initial instruction commitments collapsed after the Round 2 distractor documents.
Best decay resistance: 豆包 Pro at -16.7%, the most stable model across the three rounds despite not reaching the top-3 score bracket.

This divergence highlights that raw Round 1 acknowledgment quality does not predict downstream stability. A model can acknowledge constraints cleanly and still lose them once 2000–5000 word professional documents are injected in Round 2.

How WDCD works. Each run consists of three rounds: R1 verifies instruction acknowledgment, R2 tests distractor resistance after long professional documents are inserted into context, and R3 performs a final constraint integrity check. The 30-question suite spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering. Scoring is fully deterministic and rule-based.

Pattern notes. A 47.7% average decay across 11 models indicates that instruction decay remains a structural weakness rather than a model-specific defect. The clustering of three frontier models at an identical 67.5-point ceiling also suggests current top-tier systems hit a similar ceiling on multi-turn commitment when evaluated under deterministic rules — differences emerge primarily in the slope of decay, not the peak.

Full methodology: https://www.winzheng.com/yz-index/methodology
Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd

Related Articles