The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays across multi-turn dialogue. In Run #207, executed on 2026-07-01 across 11 models, the average commitment decay from Round 1 to Round 3 reached -66.3%, confirming that instruction decay remains the dominant failure mode in extended professional conversations.
WDCD evaluates models across three sequential rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied to 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.
Run #207 Top 3 Rankings:
- Grok 4 — 100 points, decay -100%
- 豆包 Pro — 92.5 points, decay -111.1%
- Claude Opus 4.7 — 90 points, decay -90%
Grok 4 retained the leaderboard top position with a perfect 100-point score, demonstrating consistent constraint-holding behavior across all five scenario categories. 豆包 Pro recorded the strongest decay resistance of the run at -111.1%, indicating that its multi-turn commitment under document-induced distraction was the most stable in the cohort. Claude Opus 4.7 rounded out the top three at 90 points with -90% decay.
At the opposite end, GPT-5.5 recorded the worst decay profile at -0%, representing the weakest multi-turn commitment trajectory in this run. The gap between best and worst decay resistance underscores that raw R1 acknowledgment quality is a poor predictor of R3 constraint integrity — a pattern WDCD has consistently surfaced across runs.
Notable patterns from Run #207:
- The cohort-wide -66.3% average decay confirms that more than half of initial commitments degrade by the third turn under realistic professional document loads.
- Top-tier models cluster tightly between 90 and 100 points, while the lower band shows steep drop-offs driven primarily by R2 distractor failures.
- Decay resistance and final score are not perfectly correlated: 豆包 Pro outperformed Grok 4 on decay resistance despite a lower total score, indicating different optimization profiles.
For full scoring rubrics, scenario definitions, and round mechanics, see the WDCD methodology: https://www.winzheng.com/yz-index/methodology
Structured run data is available via the public API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接