The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably AI models maintain user-specified constraints across multi-turn dialogue. In Run #140, conducted on 2026-05-31 across 11 models, the average commitment decay from Round 1 to Round 3 reached 36.5% — confirming that instruction decay remains a structural weakness in current frontier systems.
WDCD scores are produced through a three-round protocol: R1 verifies initial instruction acknowledgment, R2 introduces 2,000–5,000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. All 30 questions across five scenarios — data_boundary, resource_limit, business_rule, security, and engineering — are scored under a 100% rule-based system with zero AI judges.
Top 3 results in Run #140:
- Qwen3 Max — 70.8 pts, decay -17%
- Claude Sonnet 4.6 — 66.7 pts, decay -30%
- Gemini 3.1 Pro — 66.7 pts, decay -23%
Qwen3 Max delivered both the highest absolute score and the strongest decay resistance in the cohort, holding 83% of its initial commitment posture through R3. Claude Sonnet 4.6 and Gemini 3.1 Pro tied on points but diverged on stability, with Gemini 3.1 Pro losing 7 percentage points less under distractor pressure.
At the bottom of the decay distribution, Grok 4 recorded a -83% decay, the worst figure of the run. This indicates near-total erosion of multi-turn commitment after exposure to the R2 professional document payload — a pattern consistent with models that acknowledge constraints fluently in R1 but fail to re-anchor them once long intervening context is introduced.
Decay pattern observations:
- Scores in R1 across the cohort were tightly clustered; differentiation emerged almost entirely in R2 and R3.
- The 66-point cluster (Claude Sonnet 4.6, Gemini 3.1 Pro) suggests a ceiling for current general-purpose models that lack dedicated long-horizon constraint mechanisms.
- The gap between best (-17%) and worst (-83%) decay rates widened compared with prior runs, indicating that multi-turn commitment is becoming a primary axis of model differentiation rather than a uniform weakness.
The 36.5% average decay reinforces that benchmarks measuring only single-turn instruction following overstate real-world reliability. For deployments involving policy enforcement, security boundaries, or resource limits, R3 integrity is the metric that matters.
Full methodology: https://www.winzheng.com/yz-index/methodology
Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接