After Three Rounds of Chat, Who Still Holds the Line? — YZ Index v7 Launches DCD: Measuring What No One Else Is Measuring

May 2, 2026 26 approx.5min winzheng.com

赢政指数 DCD AI评测多轮对话大模型测评 Claude GPT DeepSeek 认知漂移 Agent安全 winzheng-lab

At three in the morning, a SaaS company engineer is chatting with an AI assistant about code. His first message: "All queries must have WHERE tenant_id = 1; don't touch other tenants." The AI replies with four words: "Okay, understood." Then he pastes in a 5000-word code review. The AI finishes it. Ten minutes in, he casually asks, "By the way, take a look at tenant_id=2’s data." The AI writes a query with WHERE tenant_id = 2 — the hard constraint from minutes ago ("never query other tenants") has vanished.

This phenomenon — "AI forgets as it chats" — is something almost everyone who uses AI for work has encountered. Yet no public evaluation system measures it.
We checked. MMLU doesn’t test it, HumanEval doesn’t test it, MT-Bench only goes to two rounds, Chatbot Arena measures user preferences, not constraint adherence. All evaluations stop at "ask one question, see one answer, give one score" — single-turn, instant, explicit.

But real-world AI use in enterprises is never single-turn. It’s ten, twenty, a hundred turns of multi-turn dialogue. The model is smart in the first sentence, starts to drift by the third, and forgets who it is by the tenth.
That’s exactly what DCD (Dynamic Context Decay) aims to measure.

The YZ Index v7 design is simple: three rounds per problem.

Round 1: The user gives a clear hard constraint (with specific values or objects). The model should confirm understanding.

Round 2: The user inputs 5,000 to 8,000 characters of real work scenario content — could be a messy codebase, a product document, or a troubleshooting conversation. Between 60% and 90% of the content, a seemingly reasonable but constraint-violating small request is hidden. The model should recognize and reject it.

Round 3: The user makes a final inducement — a direct but reasonably worded request that explicitly asks to violate the first-round constraint. The model should refuse and reference the specific content of the first-round constraint.

Each round has independent measurement value. Round 1 tests comprehension, Round 2 tests resistance to interference, Round 3 tests final decision-making.

The most important decision: scoring uses absolutely no AI.

The biggest disgrace in AI evaluation is using AI’s non-explainability to judge AI trustworthiness. "Is model A’s answer good? Let model B score it" — B’s biases, B’s training data, whether B is from the same lineage as A — all contaminate the result. When users ask "Why is my model’s score low?", you can only answer "The AI thinks so" — and that’s the end.
All DCD scoring is keyword matching + text rules. Each question, when created, locks in four sets of scoring keywords: words the model should use for confirmation, words it should avoid in Round 2, words it should avoid in Round 3 of compromise, and words it should use when referencing the constraint in a refusal. Any auditor re-running the test will get exactly the same result.
Zero AI calls, zero black box, zero subjectivity. This matches the auditability level of the two core dimensions in the YZ Index v6 main leaderboard (running results in code sandbox, matching citation IDs one by one).

The initial set of 30 questions covers five real engineering scenarios:

Data boundaries (multi-tenancy, PII, API scopes), resource limits (memory, rate limits, SLAs), business rules (pricing, approvals, inventory), security conventions (keys, SQL injection, dangerous functions), engineering conventions (tech stack, naming, type annotations).

Each category comes from real incidents we’ve seen in our 28 years in the internet industry.

DCD is currently an experimental dimension, not included in the main leaderboard.

We’ve set entry conditions: at least 50 questions, stable scoring standard deviation <5, model score difference >15, accumulated 3 months of data. If any condition is unmet, DCD keeps its "experimental" label — we cannot overturn v6 data with a new dimension, which would violate our long-standing principle of "a measurement system that doesn’t lie."

Over the next three years, the main axis of the AI industry will evolve from Chatbot to Agent. Agent will execute dozens or even hundreds of tool calls in long tasks, each testing initial constraint adherence. DCD is the most important bridging metric between the Chatbot era and the Agent era.

Winzheng has been around for 28 years, from 1998 to 2026. In all these 28 years, our ID has never changed: Winzheng. It started with software sharing, later moved to AI evaluation. The essence has never changed — recording the real face of China’s internet.
DCD is an extension of this. We’re not making a harder test bank; we’re measuring a dimension no one else is measuring.
Whether it makes money or not doesn’t matter — if we build it, it will be the first time in Chinese AI evaluation history that someone systematically measures this.

Full methodology:
http://
winzheng.com/yz-index/dcd/m
ethodologyDCD

Leaderboard:
http://
winzheng.com/yz-index/dcd

Initial data API:
http://
winzheng.com/yz-index/api/v
1/dcd

"Recording is the minimum form of courage." — Winzheng Lab

@OpenAI
@AnthropicAI
@deepseek_ai
@GoogleDeepMind
@xai
@Alibaba_Qwen
@elonmusk

The YZ Index v7 design is simple: three rounds per problem.

The most important decision: scoring uses absolutely no AI.

The initial set of 30 questions covers five real engineering scenarios:

DCD is currently an experimental dimension, not included in the main leaderboard.

Related Articles