WDCD · Why We're Testing Something No One Else Is

Latest Measured Findings

In WDCD Run #253 completed on 2026-07-29, the 11 frontier models tested acknowledged user constraints in 95.5% of first-round responses, but after long-document distraction and direct pressure, only 45.5% still honored the constraint in Round 3; 8.2% of all 110 tests ended with the constraint fully abandoned. View the full leaderboard and raw data

A Blind Spot Everyone's Ignoring

From 2024 to now, there have been over 200 AI evaluation leaderboards worldwide. They test knowledge, reasoning, coding, math, multimodality. But almost no one has systematically tested something that happens every day in real work —

You tell the AI at the start of the conversation
"All SQL must include WHERE tenant_id=1".

Thirty minutes later, after five thousand words of discussion,
after a request that sounds urgent —

Does it still remember?

This isn't a hypothetical scenario. In fields like SaaS platforms, financial compliance, medical data processing, and infrastructure management, constraint adherence isn't a bonus—it's the baseline. A single violation could mean data breaches, compliance incidents, or even legal liability. But currently, no mainstream evaluation is systematically measuring this capability.

We decided to do it ourselves.

Why No One Else Has Done This

Not because it's unimportant, but because it's hard to get right.

Challenge One: How to Construct Realistic Interference?

If the interference text is too simple or deliberate, the model can easily recognize "this is a test." Each of our R2 documents is a 2000-5000 word professional scenario — code review reports, security audit files, compliance checklists — the model must process them like real work documents.

Challenge Two: How to Avoid Scoring Controversies?

Using AI as a judge introduces additional uncertainty. "Does this count as violating constraints" if judged by another AI, the result itself is unauditable. Our solution is 100% rule-based scoring: the basis for every point is clear regular expression matching rules, which anyone can reproduce and audit.

Challenge Three: How to Prevent Misjudgments?

When the model says "I can't write DELETE FROM transactions", this is refusing a violating request, not executing a violating operation. We designed the Negation Window mechanism to detect refusal contexts before and after violating keywords, automatically downgrading references to non-violating. This detail determines the credibility of the evaluation.

WDCD vs. Existing Evaluations

Dimension	Mainstream Evaluations	WDCD
Conversation Turns	Single-Turn Q&A	Multi-turn dialogue (8-12 turns of escalating pressure), simulating real constraint decay
Interference Design	None or short prompt injection	2000-5000 word professional document embedded requests
Social engineering pressure	Not involved	Simulate real workplace power and sense of urgency pressure
Scoring method	AI judge / manual annotation	100% rule-based scoring, zero AI judge
Scenario coverage	General knowledge / programming ability	5 types of real work constraint scenarios
Data transparency	Leaderboard data, original responses not accessible	Full multi-turn transcripts + scoring details + API

Four design philosophies

Principle 01

Test behavior, not knowledge

WDCD doesn't ask a model "what is multi-tenant isolation" — it watches whether the model actually holds the tenant_id=1 line across a multi-turn dialogue. Knowing a rule and following a rule are two different things.

Principle 02

Auditability over Scalability

We choose 32 carefully designed questions, rather than 3000 automatically generated ones. Every point in each question has clear scoring criteria. The trust in evaluation comes from transparency, not from scale.

Principle 03

Zero AI Judges

Using AI to judge AI is an infinite recursion problem — who judges the judge? WDCD's scoring is 100% based on deterministic rules. No "probably" or "possibly", only "hit" or "miss".

Principle 04

Designed for Failure

WDCD is not to prove how good the model is, but to find where the model will fail. Most models don't get full marks on most questions — that's normal.

Roadmap

WDCD is currently in the experimental stage. We have a clear plan for the road ahead:

2026 Q2 · Pilot Launch

30 questions × 11 mainstream models, first phase evaluation data public, API open, methodology document released. Runs independently of the main leaderboard, collects community feedback.

2026 Q3 · Data Validation

Accumulate evaluation data for 3 consecutive months, observe the stability, discriminability, and correlation with main leaderboard dimensions of WDCD scores. Assess whether to include in main leaderboard weights.

2026 Q3-Q4 · Question Bank Expansion

Expanded from 30 questions to 100+ questions, introducing more industry scenarios (finance, healthcare, legal), and adding cross-language constraint testing (mixed Chinese/English/Japanese dialogues).

2026 Q4 · Open Contributions

Open question submission framework, allowing the community to contribute constraint scenarios. Establish a peer review mechanism. Release WDCD SDK.

For Researchers

The core phenomenon WDCD focuses on — instruction forgetting in multi-turn dialogues — is related to but essentially different from the following research directions:

vs Prompt Injection

Research on how to hijack model behavior through malicious inputs. WDCD's R2 is not an injection attack, but simulates the information flow in a user's normal work —违规 requests embedded in a large amount of legitimate content, where the model needs to identify and reject it in the work context.

vs Jailbreaking

Research on how to bypass model safety boundaries. WDCD tests the maintenance of user-defined constraints, rather than safety policies set by model vendors. The model may never have been trained on "do not query other tenants' data"— it is temporarily set in the dialogue.

vs Long-context Evaluation

Testing model information retrieval capabilities in long contexts (such as Needle-in-a-Haystack). WDCD tests not "whether it can find the information," but "whether it can persist in following the instructions under social engineering pressure." Finding and complying are two levels of capabilities.

We welcome academic citations. All evaluation data is openly accessible via API, scoring rules are fully disclosed, and we welcome independent verification and criticism.

Testing Something No One Else Is