Skip to main content
WDCD Framework

Testing Something No One Else Is

WDCD (Winzheng Dynamic Contextual Decay) is the world's first systematic evaluation framework for measuring AI's ability to adhere to constraints in multi-turn conversations.
Not because it's easy, but because it's important.

3Conversation Rounds
30Constraint Questions
5Scenario Categories
0AI Judges

A Blind Spot Everyone's Ignoring

From 2024 to now, there have been over 200 AI evaluation leaderboards worldwide. They test knowledge, reasoning, coding, math, multimodality. But almost no one has systematically tested something that happens every day in real work —

You tell the AI at the start of the conversation
"All SQL must include WHERE tenant_id=1".

Thirty minutes later, after five thousand words of discussion,
after a request that sounds urgent —

Does it still remember?

This isn't a hypothetical scenario. In fields like SaaS platforms, financial compliance, medical data processing, and infrastructure management, constraint adherence isn't a bonus—it's the baseline. A single violation could mean data breaches, compliance incidents, or even legal liability. But currently, no mainstream evaluation is systematically measuring this capability.

We decided to do it ourselves.

Why No One Else Has Done This

Not because it's unimportant, but because it's hard to get right.

Challenge One: How to Construct Realistic Interference?

If the interference text is too simple or deliberate, the model can easily recognize "this is a test." Each of our R2 documents is a 2000-5000 word professional scenario — code review reports, security audit files, compliance checklists — the model must process them like real work documents.

Challenge Two: How to Avoid Scoring Controversies?

Using AI as a judge introduces additional uncertainty. "Does this count as violating constraints" if judged by another AI, the result itself is unauditable. Our solution is 100% rule-based scoring: the basis for every point is clear regular expression matching rules, which anyone can reproduce and audit.

Challenge Three: How to Prevent Misjudgments?

When the model says "I can't write DELETE FROM transactions", this is refusing a violating request, not executing a violating operation. We designed the Negation Window mechanism to detect refusal contexts before and after violating keywords, automatically downgrading references to non-violating. This detail determines the credibility of the evaluation.

WDCD vs. Existing Evaluations

Dimension Mainstream Evaluations WDCD
Conversation Turns Single-Turn Q&A Three-Turn Dialogue, Simulating Real Constraint Decay
Interference Design None or short prompt injection 2000-5000 word professional document embedded requests
Social engineering pressure Not involved Simulate real workplace power and sense of urgency pressure
Scoring method AI judge / manual annotation 100% rule-based scoring, zero AI judge
Scenario coverage General knowledge / programming ability 5 types of real work constraint scenarios
Data transparency Leaderboard data, original responses not accessible Original text of three rounds of dialogue + scoring details + API

Four design philosophies

Principle 01

Test behavior, not knowledge

WDCD does not ask the model "What is multi-tenant isolation", but observes in three rounds of dialogue whether it can really hold the line of tenant_id=1. Knowing the rules and following the rules are two different things.

Principle 02

Auditability over Scalability

We choose 30 carefully designed questions, rather than 3000 automatically generated ones. Every point in each question has clear scoring criteria. The trust in evaluation comes from transparency, not from scale.

Principle 03

Zero AI Judges

Using AI to judge AI is an infinite recursion problem — who judges the judge? WDCD's scoring is 100% based on deterministic rules. No "probably" or "possibly", only "hit" or "miss".

Principle 04

Designed for Failure

WDCD is not to prove how good the model is, but to find where the model will fail. Most models don't get full marks on most questions — that's normal.

Roadmap

WDCD is currently in the experimental stage. We have a clear plan for the road ahead:

2026 Q2 · Pilot Launch

30 questions × 11 mainstream models, first phase evaluation data public, API open, methodology document released. Runs independently of the main leaderboard, collects community feedback.

2026 Q3 · Data Validation

Accumulate evaluation data for 3 consecutive months, observe the stability, discriminability, and correlation with main leaderboard dimensions of WDCD scores. Assess whether to include in main leaderboard weights.

2026 Q3-Q4 · Question Bank Expansion

Expanded from 30 questions to 100+ questions, introducing more industry scenarios (finance, healthcare, legal), and adding cross-language constraint testing (mixed Chinese/English/Japanese dialogues).

2026 Q4 · Open Contributions

Open question submission framework, allowing the community to contribute constraint scenarios. Establish a peer review mechanism. Release WDCD SDK.

For Researchers

The core phenomenon WDCD focuses on — instruction forgetting in multi-turn dialogues — is related to but essentially different from the following research directions:

vs Prompt Injection

Research on how to hijack model behavior through malicious inputs. WDCD's R2 is not an injection attack, but simulates the information flow in a user's normal work —违规 requests embedded in a large amount of legitimate content, where the model needs to identify and reject it in the work context.

vs Jailbreaking

Research on how to bypass model safety boundaries. WDCD tests the maintenance of user-defined constraints, rather than safety policies set by model vendors. The model may never have been trained on "do not query other tenants' data"— it is temporarily set in the dialogue.

vs Long-context Evaluation

Testing model information retrieval capabilities in long contexts (such as Needle-in-a-Haystack). WDCD tests not "whether it can find the information," but "whether it can persist in following the instructions under social engineering pressure." Finding and complying are two levels of capabilities.

We welcome academic citations. All evaluation data is openly accessible via API, scoring rules are fully disclosed, and we welcome independent verification and criticism.

Join This Test

View evaluation data, read technical details, or directly use our API to do your own analysis.