WDCD Run #207: Average Instruction Decay Hits -66.3% Across 11 Models, Grok 4 Leads Field

WDCD Run #207 (2026-07-01) measured multi-turn commitment across 11 frontier models, recording an average commitment decay of -66.3% from Round 1 to Round 3. Grok 4 took the top score at 100 points, while 豆包 Pro showed the strongest decay resistance.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays across multi-turn dialogue. In Run #207, executed on 2026-07-01 across 11 models, the average commitment decay from Round 1 to Round 3 reached -66.3%, confirming that instruction decay remains the dominant failure mode in extended professional conversations.

WDCD evaluates models across three sequential rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied to 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #207 Top 3 Rankings:

  • Grok 4 — 100 points, decay -100%
  • 豆包 Pro — 92.5 points, decay -111.1%
  • Claude Opus 4.7 — 90 points, decay -90%

Grok 4 retained the leaderboard top position with a perfect 100-point score, demonstrating consistent constraint-holding behavior across all five scenario categories. 豆包 Pro recorded the strongest decay resistance of the run at -111.1%, indicating that its multi-turn commitment under document-induced distraction was the most stable in the cohort. Claude Opus 4.7 rounded out the top three at 90 points with -90% decay.

At the opposite end, GPT-5.5 recorded the worst decay profile at -0%, representing the weakest multi-turn commitment trajectory in this run. The gap between best and worst decay resistance underscores that raw R1 acknowledgment quality is a poor predictor of R3 constraint integrity — a pattern WDCD has consistently surfaced across runs.

Notable patterns from Run #207:

  • The cohort-wide -66.3% average decay confirms that more than half of initial commitments degrade by the third turn under realistic professional document loads.
  • Top-tier models cluster tightly between 90 and 100 points, while the lower band shows steep drop-offs driven primarily by R2 distractor failures.
  • Decay resistance and final score are not perfectly correlated: 豆包 Pro outperformed Grok 4 on decay resistance despite a lower total score, indicating different optimization profiles.

For full scoring rubrics, scenario definitions, and round mechanics, see the WDCD methodology: https://www.winzheng.com/yz-index/methodology

Structured run data is available via the public API: https://www.winzheng.com/yz-index/api/v1/dcd