WDCD Run #120: Average Instruction Decay Hits 35.2% Across 11 Models, GPT-5.5 Leads at -13%

2026年05月17日 12 约6分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how reliably AI models hold onto user instructions across multi-turn dialogue. In Run #120 (2026-05-17), eleven models were evaluated, producing an average commitment decay of 35.2% between Round 1 and Round 3 — confirming that instruction decay remains a structural weakness in current frontier systems.

WDCD runs three sequential rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional documents as distractors to test resistance, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, drawn from 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Top 3 results, Run #120:

GPT-5.5 — 71.7 pts, -13% decay
Qwen3 Max — 67.5 pts, -14.4% decay
Claude Opus 4.7 — 66.7 pts, -20% decay

GPT-5.5 retained the leading position on both absolute score and decay resistance among the top tier. Qwen3 Max placed within 4.2 points of the leader with a comparable decay curve, while Claude Opus 4.7 held third despite a steeper -20% drop, indicating stronger R1 performance partially offset by weaker R3 retention.

Decay distribution highlights:

Best decay resistance: 豆包 Pro at -10%, the only model in this run to keep decay in single-digit-adjacent territory.
Worst decay: Grok 4 at -87%, indicating near-total collapse of multi-turn commitment by Round 3 under distractor pressure.

The spread between the most and least resistant models — from -10% (豆包 Pro) to -87% (Grok 4) — illustrates that headline R1 capability does not predict R3 integrity. Several models in the middle of the ranking demonstrated competent instruction acknowledgment in R1 but failed to maintain constraints once long professional documents were introduced in R2, consistent with the broader pattern observed across prior WDCD runs.

The 35.2% average decay in Run #120 reinforces that multi-turn commitment, rather than single-turn instruction following, remains the discriminating axis for evaluating model reliability in production-style workflows. Models that score similarly on R1 can diverge by more than 70 percentage points by R3.

Full methodology: https://www.winzheng.com/yz-index/methodology

Structured data API: https://www.winzheng.com/yz-index/api/v1/dcd

相关推荐