WDCD Run #211: Grok 4 Leads with Just -13% Instruction Decay as GPT-o3 Collapses at -75%

WDCD Run #211 (2026-07-03) benchmarked 11 models on multi-turn commitment integrity, with Grok 4 taking the top spot at 91.2 points and only -13% decay, while GPT-o3 posted the worst decay rate at -75%.

The Winzheng Dynamic Contextual Decay (WDCD) benchmark measures how well large language models retain user-imposed constraints across multi-turn dialogue. In Run #211, dated 2026-07-03, 11 models were evaluated and the cohort recorded an average commitment decay of 39% from Round 1 to Round 3, with Grok 4 emerging as the clear leader on both raw score and decay resistance.

Top 3 rankings — Run #211:

  • Grok 4 — 91.2 pts, -13% decay (best decay resistance in the run)
  • Gemini 3.1 Pro — 79.1 pts, -37% decay
  • GPT-o3 — 76.6 pts, -75% decay (worst decay in the run)

The GPT-o3 result is the most notable data point of this run. Despite entering the top three on absolute score, GPT-o3 exhibited the steepest instruction decay of any tested model, indicating that its Round 1 acknowledgment strength does not translate into Round 3 constraint integrity. This is a textbook example of why single-turn evaluations misrepresent real-world reliability: high initial compliance can mask severe multi-turn commitment erosion.

Gemini 3.1 Pro sat close to the cohort mean at -37% decay, roughly aligned with the 39% average across all 11 models. Grok 4's -13% figure is the outlier on the resistance side — it is the only top-tier model in this run whose Round 3 performance stayed within striking distance of its Round 1 baseline.

Decay pattern observations:

  • The gap between R1 acknowledgment and R3 constraint integrity widened sharply for reasoning-oriented models after the R2 distractor phase, which injects 2000–5000 word professional documents between the original instruction and the final check.
  • Score rank and decay rank diverged significantly — a reminder that WDCD's headline score already penalizes decay, but the raw decay percentage remains the more actionable signal for production deployment.
  • No model in Run #211 achieved sub-10% decay, suggesting that instruction decay remains an unsolved problem across current-generation frontier models.

WDCD uses 100% rule-based scoring with zero AI judges, running 30 questions across five real-world scenarios: data_boundary, resource_limit, business_rule, security, and engineering. Each item passes through three rounds — R1 instruction acknowledgment, R2 distractor resistance, and R3 final constraint integrity — producing a deterministic, reproducible measure of multi-turn commitment.

Full methodology: https://www.winzheng.com/yz-index/methodology
Machine-readable results: https://www.winzheng.com/yz-index/api/v1/dcd