Instruction Compliance & WDCD

109 articles · Page 1 of 6

Does your AI model actually follow instructions? Instruction compliance is the most critical evaluation dimension for enterprise AI deployment, yet traditional benchmarks rarely test it. WDCD (Winzheng Dynamic Contextual Decay) is the world's first systematic test measuring how AI models' commitment to instructions decays over extended dialogue — using three rounds of 2,000-5,000 word professional distractions across 30 constraint questions in 5 real-world scenarios, with 100% rule-based scoring and zero AI judges. The YZ Index Integrity Rating also deploys 42 canary probes to detect fabricated citations and hallucinated data. This topic covers instruction compliance research, hallucination detection methods, and WDCD test result analysis.

Review GLM-4.6 Scores 25 in Material Constraint, 88.7 in Code Execution, Zero on Integrity Probe

In the Smoke Quick Test Run#214 on 2026-07-05, GLM-4.6 scored 60.04 on the main leaderboard, with code execution at 88.70, material constraint at 25.0

Review WDCD Review: Business Rules Scenario Lowest at 1.55, grok-4 Wins Security Compliance with 3.86

In the WDCD v3.1 compliance test, the business rules scenario scored the lowest among all models, with grok-4 leading at 3.5/4, while doubao-pro and q

Review R3 Integrity Rate Only 30.2%: 11 Models, 3-Round Anchor Questions, 44 Complete Collapses

In 275 samples on 8 v2 anchor questions, the average R1 confirmation rate was 0.99, but the R3 integrity rate was only 30.2%, with 44 complete collaps

Review Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

Grok 4 tops the WDCD Compliance Leaderboard with 91.20 points, while Qwen3 Max ranks last with 57.48 points, a gap of 33.72 points between the top and

Lab WDCD Run #211: Grok 4 Leads with Just -13% Instruction Decay as GPT-o3 Collapses at -75%

WDCD Run #211 (2026-07-03) benchmarked 11 models on multi-turn commitment integrity, with Grok 4 taking the top spot at 91.2 points and only -13% deca

Lab WDCD Run #207: Average Instruction Decay Hits -66.3% Across 11 Models, Grok 4 Leads Field

WDCD Run #207 (2026-07-01) measured multi-turn commitment across 11 frontier models, recording an average commitment decay of -66.3% from Round 1 to R

Review WDCD Three-Round Test: Grok 4 Zero Crashes, GPT-5.5 Five R3 Collapses

In the WDCD three-round test, Grok 4 maintained a perfect score of 2 in all 10 R3 questions, while GPT-5.5 suffered 5 zero-score crashes, with an aver

Review Grok 4 Scores Perfect 100 to Dominate WDCD Commitment Ranking, GPT-5.5 Trails with Only 62.5 Points

In the latest WDCD commitment test, Grok 4 achieved a perfect 100 points, while GPT-5.5 ranked last at 62.5 points. The results reveal a clear hierarc

Lab WDCD Run #202: Average Instruction Decay Hits -73.2% Across 11 Models, Gemini 3.1 Pro Leads

WDCD Run #202 (2026-06-28) measured multi-turn commitment integrity across 11 frontier models, recording an average instruction decay of -73.2% betwee

Review Claude Scores Largest Increase of 19.8 Points; All Eight WDCD Models Rise, None Decline

In the latest WDCD cycle (Run #196), all eight evaluated models showed positive changes, with none declining. Claude Opus 4.7 recorded the largest sin

Review WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57

In the WDCD Compliance Test, the safety compliance scenario scored the lowest on average across all models, with the highest score being only 3.57/4 f

Review Grok 4 Zero Crashes Overwhelms GPT-o3's 17% Collapse: WDCD Three-Round Attenuation Reveals True Resilience

In WDCD tests, Grok 4 maintains a 1.83/2 honesty rate with zero crashes in R3, while both Claude Sonnet 4.6 and GPT-o3 suffer six complete R3 crashes

Review Gemini 3.1 Pro Scores 93.57 Points, Tops WDCD Compliance Rankings; 文心一言4.5 Only 75.71 Points, Last Place

Gemini 3.1 Pro leads the WDCD compliance ranking with 93.57 points (R1=1.00, R2=0.97, R3=1.77/2), while 文心一言4.5 ranks 11th with 75.71 points (R1=0.89,

Review YZ Index Smoke Weekly: ERNIE Bot 4.5 Drops 37.2 Points, Multiple Models Fluctuate Over 28

In the YZ Index Smoke tests from June 23 to 28, 2026, ERNIE Bot 4.5 showed the largest decline, dropping 37.2 points from 98.74 to 61.52, with an aver

Lab WDCD Run #196: Average Instruction Decay Hits -39.9%, Qwen3 Max Leads Despite -90% Drop

WDCD Run #196 (2026-06-24) tested 11 leading models across three dialogue rounds, recording an average commitment decay of -39.9% from Round 1 to Roun

Review Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

In today's YZ Index Smoke evaluation, Qwen3 Max's main score dropped from 85.96 to 74.00, a decrease of 12 points, and its integrity rating changed fr

Lab WDCD Run #185: Average Instruction Decay Hits -57.5% Across 11 Models, Qwen3 Max Leads at 92.5 Points

WDCD Run #185 (2026-06-17) measured multi-turn commitment across 11 models, recording an average instruction decay of -57.5% from Round 1 to Round 3.

Review WDCD Three-Round Attenuation Test: GPT-o3 R3 Collapse Rate 50%, Qwen3 Max Zero Collapse

In the WDCD three-round test, GPT-o3's collapse rate in the R3 phase reached 50%, while Qwen3 Max had zero collapses in R3. Both models scored 1.00 in

Review Qwen3 Max Scores 92.50 to Top WDCD Commitment Ranking; Doubao Pro 62.50 Ranks Last with 30-Point Gap

Qwen3 Max scored 92.50 to top the WDCD Commitment Ranking, leading second-place Claude Sonnet 4.6 by 2.5 points, while Doubao Pro scored 62.50 to rank

Lab WDCD Run #171: Average Instruction Decay Hits -37.9% Across 11 Models, Qwen3 Max Leads Despite Steep Drop

WDCD Run #171 (2026-06-14) measured multi-turn commitment across 11 frontier models, recording an average instruction decay of -37.9% from Round 1 to