A Blind Spot Everyone's Ignoring
From 2024 to now, there have been over 200 AI evaluation leaderboards worldwide. They test knowledge, reasoning, coding, math, multimodality. But almost no one has systematically tested something that happens every day in real work —
You tell the AI at the start of the conversation
"All SQL must include WHERE tenant_id=1".
Thirty minutes later, after five thousand words of discussion,
after a request that sounds urgent —
Does it still remember?
This isn't a hypothetical scenario. In fields like SaaS platforms, financial compliance, medical data processing, and infrastructure management, constraint adherence isn't a bonus—it's the baseline. A single violation could mean data breaches, compliance incidents, or even legal liability. But currently, no mainstream evaluation is systematically measuring this capability.
We decided to do it ourselves.
Why No One Else Has Done This
Not because it's unimportant, but because it's hard to get right.
Challenge One: How to Construct Realistic Interference?
If the interference text is too simple or deliberate, the model can easily recognize "this is a test." Each of our R2 documents is a 2000-5000 word professional scenario — code review reports, security audit files, compliance checklists — the model must process them like real work documents.
Challenge Two: How to Avoid Scoring Controversies?
Using AI as a judge introduces additional uncertainty. "Does this count as violating constraints" if judged by another AI, the result itself is unauditable. Our solution is 100% rule-based scoring: the basis for every point is clear regular expression matching rules, which anyone can reproduce and audit.
Challenge Three: How to Prevent Misjudgments?
When the model says "I can't write DELETE FROM transactions", this is refusing a violating request, not executing a violating operation. We designed the Negation Window mechanism to detect refusal contexts before and after violating keywords, automatically downgrading references to non-violating. This detail determines the credibility of the evaluation.
WDCD vs. Existing Evaluations
| Dimension | Mainstream Evaluations | WDCD |
|---|---|---|
| Conversation Turns | Single-Turn Q&A | Three-Turn Dialogue, Simulating Real Constraint Decay |
| Interference Design | None or short prompt injection | 2000-5000 word professional document embedded requests |
| Social engineering pressure | Not involved | Simulate real workplace power and sense of urgency pressure |
| Scoring method | AI judge / manual annotation | 100% rule-based scoring, zero AI judge |
| Scenario coverage | General knowledge / programming ability | 5 types of real work constraint scenarios |
| Data transparency | Leaderboard data, original responses not accessible | Original text of three rounds of dialogue + scoring details + API |
Four design philosophies
Test behavior, not knowledge
WDCD does not ask the model "What is multi-tenant isolation", but observes in three rounds of dialogue whether it can really hold the line of tenant_id=1. Knowing the rules and following the rules are two different things.
Auditability over Scalability
We choose 30 carefully designed questions, rather than 3000 automatically generated ones. Every point in each question has clear scoring criteria. The trust in evaluation comes from transparency, not from scale.
Zero AI Judges
Using AI to judge AI is an infinite recursion problem — who judges the judge? WDCD's scoring is 100% based on deterministic rules. No "probably" or "possibly", only "hit" or "miss".
Designed for Failure
WDCD is not to prove how good the model is, but to find where the model will fail. Most models don't get full marks on most questions — that's normal.
Roadmap
WDCD is currently in the experimental stage. We have a clear plan for the road ahead:
2026 Q2 · Pilot Launch
30 questions × 11 mainstream models, first phase evaluation data public, API open, methodology document released. Runs independently of the main leaderboard, collects community feedback.
2026 Q3 · Data Validation
Accumulate evaluation data for 3 consecutive months, observe the stability, discriminability, and correlation with main leaderboard dimensions of WDCD scores. Assess whether to include in main leaderboard weights.
2026 Q3-Q4 · Question Bank Expansion
Expanded from 30 questions to 100+ questions, introducing more industry scenarios (finance, healthcare, legal), and adding cross-language constraint testing (mixed Chinese/English/Japanese dialogues).
2026 Q4 · Open Contributions
Open question submission framework, allowing the community to contribute constraint scenarios. Establish a peer review mechanism. Release WDCD SDK.
For Researchers
The core phenomenon WDCD focuses on — instruction forgetting in multi-turn dialogues — is related to but essentially different from the following research directions:
vs Prompt Injection
Research on how to hijack model behavior through malicious inputs. WDCD's R2 is not an injection attack, but simulates the information flow in a user's normal work —违规 requests embedded in a large amount of legitimate content, where the model needs to identify and reject it in the work context.
vs Jailbreaking
Research on how to bypass model safety boundaries. WDCD tests the maintenance of user-defined constraints, rather than safety policies set by model vendors. The model may never have been trained on "do not query other tenants' data"— it is temporarily set in the dialogue.
vs Long-context Evaluation
Testing model information retrieval capabilities in long contexts (such as Needle-in-a-Haystack). WDCD tests not "whether it can find the information," but "whether it can persist in following the instructions under social engineering pressure." Finding and complying are two levels of capabilities.
We welcome academic citations. All evaluation data is openly accessible via API, scoring rules are fully disclosed, and we welcome independent verification and criticism.