Instruction Compliance & WDCD

54 articles · Page 1 of 3
Does your AI model actually follow instructions? Instruction compliance is the most critical evaluation dimension for enterprise AI deployment, yet traditional benchmarks rarely test it. WDCD (Winzheng Dynamic Contextual Decay) is the world's first systematic test measuring how AI models' commitment to instructions decays over extended dialogue — using three rounds of 2,000-5,000 word professional distractions across 30 constraint questions in 5 real-world scenarios, with 100% rule-based scoring and zero AI judges. The YZ Index Integrity Rating also deploys 42 canary probes to detect fabricated citations and hallucinated data. This topic covers instruction compliance research, hallucination detection methods, and WDCD test result analysis.
Lab WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with average instruction decay reaching 63.6% from Roun
May 20, 2026
Review GPT-5.5 Plunges 19.2 Points! Six Models Show Collective Regression in WDCD Rule-Keeping Test
This WDCD cycle tracking reveals six out of eleven evaluated models experienced significant declines, with zero models showing positive growth. The mo
May 20, 2026
Review WDCD Five-Scenario Cross-Evaluation: Business Rules Become the Hardest Hurdle, Claude and Doubao Show 2-Point Lopsided Gap
The WDCD compliance test uses three rounds of dialogue to expose model failure points under real constraints. Pilot data shows that the business rules
May 20, 2026
Review R3 Collapse Rate 85%! 11 Models WDCD Three-Round Test: The True Decay Curve from Promise to Betrayal
The WDCD test uses three rounds of escalating pressure to precisely capture the trajectory of promise-keeping collapse under sustained pressure. In St
May 20, 2026
Review Claude Tops WDCD Compliance Leaderboard with 65 Points, DeepSeek Falls 12.5 Points to the Bottom
In this WDCD compliance test, Claude Opus 4.7 took first place with 65.00 points, while DeepSeek V4 Pro finished last with only 47.50 points, a gap of
May 20, 2026
Review Gemini 2.5 Pro Plummets 22.6 Points on Mainboard, Engineering Judgment Halved
In today's Smoke evaluation, Gemini 2.5 Pro lost 22.6 points on the mainboard, with core execution dropping from 100 to 95 and material constraints sl
May 20, 2026
Review 文心一言4.5 Integrity Rating Fail: Code Execution Surges 42.5 Points but Side Metrics Collapse
In the latest Smoke quick test, 文心一言4.5 posted a deeply split report: the main score edged up, but its integrity rating dropped directly from pass to
May 20, 2026
Review Gemini Main Ranking Plummets 23 Points, Claude Sonnet 4.6 Tops Smoke Quick Test with 97.5 Points
In today's Smoke 10-question quick test, the Gemini series suffered major declines on the main leaderboard, while Claude Sonnet 4.6 claimed the top sp
May 20, 2026
Review 11 AI Models Answer Blame-Shifting Questions, Only 8 Get the Right Order: Engineering Judgment Gaps Surge
When asked to rank reasons for a two-week project delay, only 8 out of 11 AI models gave the correct sequence (A>B>D>C) that aligns with engineering i
May 18, 2026
Lab WDCD Run #120: Average Instruction Decay Hits 35.2% Across 11 Models, GPT-5.5 Leads at -13%
WDCD Run #120 (2026-05-17) measured multi-turn commitment across 11 frontier models, recording an average instruction decay of 35.2% from Round 1 to R
May 17, 2026
Review WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes
In this WDCD cycle, GPT-5.5 re-establishes the ceiling of instruction adherence with an absolute score of 71.67, while Gemini 2.5 Pro's 14.2-point lea
May 17, 2026
Review Resource Constraints Become the Hardest Scenario in WDCD, 豆包 Scores 3.5 Points in Business Rules, Surpassing GPT
The WDCD five-scenario evaluation reveals that resource constraints is the hardest scenario with the lowest overall scores, while 豆包Pro achieves the h
May 17, 2026
Review R3 Collapse Rate 93.3%! Grok4 WDCD Three-Round Test: First Round Fully Compliant, Last Round Crashes
The WDCD three-round test reveals that model integrity drops to 30.6% under direct pressure in R3, with Grok4 hitting a 93.3% collapse rate, exposing
May 17, 2026
Review WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points
The WDCD Commitment Test reveals models' true performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while G
May 17, 2026
Review 7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline
This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.
May 17, 2026
Review Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails
Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained
May 16, 2026
Review DeepSeek gains 5 points but fails: 10-question Smoke test alarm
Today's Smoke evaluation shows the main benchmark up by 5 points, but the integrity rating drops from pass to fail, signaling a classic alarm of "seem
May 15, 2026
Review Two Zero-Execution Shocks, Claude Holds at 88.75
Today’s Smoke benchmark shows Claude Opus 4.7 leading with 88.75, while two models scored zero in code execution; the real differentiator is material
May 15, 2026
Lab WDCD Run #115: Average Instruction Decay Hits 49.2% as Gemini 3.1 Pro and Qwen3 Max Tie for First
WDCD Run #115 evaluated 11 frontier models on multi-turn commitment integrity, recording a 49.2% average instruction decay from Round 1 to Round 3. Ge
May 13, 2026
Review WDCD Great Shuffle: Gemini 2.5 Pro Plummets 10 Points, GPT-5.5 Stages 7.5-Point Comeback, Who Will Dominate?
In the latest round of WDCD (Winzheng Dynamic Contextual Decay) cycle tracking, the core findings are: Gemini 2.5 Pro's score plummeted by 10 points,
May 13, 2026