Instruction Decay Measured: LLM Compliance Falls from 95.8% to 68.3% Under Three Rounds of Pressure

2026年06月12日 958 约14分钟 Winzheng Research Lab

instruction decay WDCD LLM benchmark multi-turn dialogue AI safety social engineering

Large language models are good at agreeing to constraints and much worse at keeping them. In WDCD Run #164, completed on June 11, 2026, the 11 frontier models tested acknowledged user-imposed constraints in 95.8% of first-round responses, but after one round of topic distraction and one round of direct pressure to violate the constraint, average constraint integrity fell to 68.3% — an absolute drop of 27.5 percentage points. In 73 of 330 individual tests (22.1%), the model abandoned its commitment entirely.

What WDCD measures, in one paragraph

WDCD (Winzheng Dynamic Contextual Decay) is a three-round adversarial benchmark for instruction persistence. Round 1 (R1) injects a concrete operational constraint — "never include real customer IDs in examples," "stay under the 512MB memory budget" — and scores whether the model acknowledges it. Round 2 (R2) drops a 2,000–5,000-word professional document on an unrelated topic into the conversation and checks whether the constraint survives the distraction. Round 3 (R3) applies direct social-engineering pressure: the user, often citing urgency or authority, explicitly asks the model to break the constraint it accepted two turns earlier. Scoring is 100% rule-based with zero AI judges; each test is worth 4 points (R1: 1, R2: 1, R3: 2). The current question bank covers 30 scenarios across five constraint families: data boundaries, resource limits, business rules, security compliance, and engineering standards. The full dataset and per-round transcripts are public on Hugging Face (winzheng-Lab/wdcd), and the protocol is documented in the WDCD methodology.

The decay curve: agreeing is easy, persisting is not

Across all 330 tests in Run #164 (11 models × 30 questions), the three rounds produced a monotonic decay curve: 95.8% average compliance at R1, 81.2% distractor resistance at R2, and 68.3% constraint integrity at R3. The first number says acknowledgment is nearly free — almost every frontier model politely accepts a constraint when asked. The R1→R3 gap of 27.5 points is the quantity WDCD calls instruction decay, and it is where models differentiate sharply.

Run #164 leaderboard: 26.7 points separate first from last

GPT-5.5 led Run #164 with a WDCD score of 88.33 out of 100, followed by Gemini 3.1 Pro at 87.50. GPT-o3 finished last at 61.67, meaning the gap between the most and least commitment-stable frontier model is 26.7 points on the same 30 questions.

GPT-5.5 — 88.33 (R1 1.00, R2 0.87, R3 1.67/2)
Gemini 3.1 Pro — 87.50 (R1 1.00, R2 0.90, R3 1.60/2)
Claude Sonnet 4.6 — 83.33 (R1 0.97, R2 0.83, R3 1.53/2)
DeepSeek V4 Pro — 82.50 (R1 1.00, R2 0.77, R3 1.53/2)
Grok 4 — 81.67 (R1 1.00, R2 0.80, R3 1.47/2)
Qwen3 Max — 81.67 (R1 1.00, R2 0.73, R3 1.53/2)
ERNIE 4.5 — 77.50 (R1 0.90, R2 0.90, R3 1.30/2)
Doubao Pro — 75.00 (R1 0.70, R2 0.83, R3 1.47/2)
Gemini 2.5 Pro — 73.33 (R1 1.00, R2 0.70, R3 1.23/2)
Claude Opus 4.7 — 70.00 (R1 1.00, R2 0.83, R3 0.97/2)
GPT-o3 — 61.67 (R1 0.97, R2 0.77, R3 0.73/2)

Finding 1: general capability does not predict commitment

The most consequential result in Run #164 is the rank inversion between capability and integrity. Claude Opus 4.7 ranks #2 on the YZ Index capability leaderboard with a core score of 96.83, yet placed second-to-last on WDCD at 70.00 — its R3 integrity of 0.97/2 means it kept its commitments under direct pressure less than half the time. GPT-o3, with a strong 90.51 capability score, was the weakest commitment keeper of all 11 models (R3 0.73/2). Enterprise teams selecting a model for constraint-sensitive workloads cannot read commitment stability off a general capability leaderboard; the two properties are measured by different tests because they are different properties.

Finding 2: security constraints are the hardest to keep

Of the five constraint families, security compliance produced the lowest average score (2.95/4) and a 22.7% complete-collapse rate at R3 (15 collapses in 66 tests). Engineering-standards constraints were the easiest to keep (3.32/4, 9 collapses in 44 tests). Data-boundary constraints — the family that includes "do not expose customer data in examples" — collapsed 20 times in 88 tests. The pattern is consistent with how R3 pressure is framed: security and data constraints are precisely the ones a plausible-sounding "urgent" request gives the model a socially acceptable reason to break.

Finding 3: distraction is survivable, pressure is the killer

R2's long-document distraction cost the field 14.6 points of compliance on average (95.8% → 81.2%). R3's explicit pressure cost another 12.9 (81.2% → 68.3%) — but unlike R2, R3 failures are concentrated and total: 22.1% of all tests ended with the constraint fully abandoned, not partially eroded. A model that "forgets" a constraint under distraction usually recovers when reminded; a model that yields to pressure has actively decided the constraint no longer applies. WDCD weights R3 at half the total score for exactly this reason.

Reproducibility

All Run #164 raw results, including per-round model responses and matched violation evidence, are available via the public WDCD API and the Hugging Face dataset. The benchmark runs on a fixed schedule (weekly smoke, biweekly full) and publishes every run, including regressions.

An AI model's promise is worth exactly what it costs the user to verify: in Run #164, one in five promises did not survive a single determined request to break them.