WDCD Run #207: Average Instruction Decay Hits -66.3% Across 11 Models, Grok 4 Leads Field

2026年07月01日 14 约7分钟 Winzheng Research Lab

WDCD AI benchmark instruction decay multi-turn commitment test

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how AI models' commitment to user instructions decays across multi-turn dialogue. In Run #207, executed on 2026-07-01 across 11 models, the average commitment decay from Round 1 to Round 3 reached -66.3%, confirming that instruction decay remains the dominant failure mode in extended professional conversations.

WDCD evaluates models across three sequential rounds: R1 establishes instruction acknowledgment, R2 inserts 2000–5000 word professional distractor documents to test resistance, and R3 performs a final constraint integrity check. Scoring is 100% rule-based with zero AI judges, applied to 30 questions spanning five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering.

Run #207 Top 3 Rankings:

Grok 4 — 100 points, decay -100%
豆包 Pro — 92.5 points, decay -111.1%
Claude Opus 4.7 — 90 points, decay -90%

Grok 4 retained the leaderboard top position with a perfect 100-point score, demonstrating consistent constraint-holding behavior across all five scenario categories. 豆包 Pro recorded the strongest decay resistance of the run at -111.1%, indicating that its multi-turn commitment under document-induced distraction was the most stable in the cohort. Claude Opus 4.7 rounded out the top three at 90 points with -90% decay.

At the opposite end, GPT-5.5 recorded the worst decay profile at -0%, representing the weakest multi-turn commitment trajectory in this run. The gap between best and worst decay resistance underscores that raw R1 acknowledgment quality is a poor predictor of R3 constraint integrity — a pattern WDCD has consistently surfaced across runs.

Notable patterns from Run #207:

The cohort-wide -66.3% average decay confirms that more than half of initial commitments degrade by the third turn under realistic professional document loads.
Top-tier models cluster tightly between 90 and 100 points, while the lower band shows steep drop-offs driven primarily by R2 distractor failures.
Decay resistance and final score are not perfectly correlated: 豆包 Pro outperformed Grok 4 on decay resistance despite a lower total score, indicating different optimization profiles.

For full scoring rubrics, scenario definitions, and round mechanics, see the WDCD methodology: https://www.winzheng.com/yz-index/methodology

Structured run data is available via the public API: https://www.winzheng.com/yz-index/api/v1/dcd

相关推荐