WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top

WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.5 Pro, and Qwen3 Max tying for first at 67.5 points.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions degrades over multi-turn dialogue, using 100% rule-based scoring with zero AI judges. In Run #157, completed on 2026-06-10 across 11 models, the average commitment decay from Round 1 to Round 3 reached 47.7%, with three models tying at the top.

Top of the leaderboard. Three models finished at 67.5 points each:

  • Claude Sonnet 4.6 — 67.5 pts, -30% decay
  • Gemini 2.5 Pro — 67.5 pts, -20% decay
  • Qwen3 Max — 67.5 pts, -30% decay

Among the three, Gemini 2.5 Pro shows the strongest multi-turn commitment, losing only 20% of its initial constraint adherence by Round 3. Claude Sonnet 4.6 and Qwen3 Max reached the same score via slightly higher Round 1 ceilings offset by steeper -30% drops.

Decay extremes. The widest spread in this run was between the worst and best decay performers:

  • Worst decay: Grok 4 at -90%, meaning nearly all initial instruction commitments collapsed after the Round 2 distractor documents.
  • Best decay resistance: 豆包 Pro at -16.7%, the most stable model across the three rounds despite not reaching the top-3 score bracket.

This divergence highlights that raw Round 1 acknowledgment quality does not predict downstream stability. A model can acknowledge constraints cleanly and still lose them once 2000–5000 word professional documents are injected in Round 2.

How WDCD works. Each run consists of three rounds: R1 verifies instruction acknowledgment, R2 tests distractor resistance after long professional documents are inserted into context, and R3 performs a final constraint integrity check. The 30-question suite spans five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering. Scoring is fully deterministic and rule-based.

Pattern notes. A 47.7% average decay across 11 models indicates that instruction decay remains a structural weakness rather than a model-specific defect. The clustering of three frontier models at an identical 67.5-point ceiling also suggests current top-tier systems hit a similar ceiling on multi-turn commitment when evaluated under deterministic rules — differences emerge primarily in the slope of decay, not the peak.

Full methodology: https://www.winzheng.com/yz-index/methodology
Raw data API: https://www.winzheng.com/yz-index/api/v1/dcd