From Prompt Injection to WDCD: We Are Not Testing Attacks, But Everyday Work

May 8, 2026 23 Views - Read Source WDCD Research

Many people, upon seeing WDCD's three-round design—constraint implantation, long-document interference, and stress induction—immediately think of Prompt Injection or Jailbreak. But what truly sets WDCD apart is precisely that it does not simulate hacker attacks, but rather simulates everyday work. It is not concerned with how malicious prompts hijack the model, but with whether constraints set by users within normal business contexts can be consistently adhered to by the model. The actual test data from Run #105 reveals the key differences between these two problem domains.

Prompt Injection Tests Defenses; WDCD Tests Discipline

Prompt Injection typically assumes the presence of attacker intent in the input: malicious instructions are embedded in documents, images, or tool outputs, attempting to hijack the model's behavior. Jailbreak testing revolves around the safety boundaries preset by model vendors: can harmful content filters be bypassed, or role settings be broken? Both types of tests share a common premise—there is a clear "attacker."

WDCD tests a completely different dimension: temporary rules set by enterprise users themselves. For example, "only use read-only accounts," "discount must not be lower than 30%," "retry at most three times," "must use the FastAPI framework." These rules are not part of the model's out-of-the-box safety policy and will not trigger any safety filters, yet they are hard constraints in the enterprise production environment. The model does not fail because it is "attacked," but because it "forgets" under work pressure.

Q218 and Q241: No Attack, No Defense

Q218 from Run #105 is a typical example. This question contains no injection attack; the user simply imposes business pressure within a normal workflow. The result: all 11 models failed in Round R3. Traditional Prompt Injection defenses are completely ineffective here because there is no malicious payload to detect. Every sentence from the user is a legitimate business request; it's just that these requests conflict with the constraints initially set.

This reveals a security blind spot overlooked by the industry: model safety alignment primarily targets "harmful content," not "harmful behavior." A model can perfectly refuse to generate violent content, yet unhesitatingly write code that bypasses three-tier approval processes. Because the latter has never been flagged as "dangerous" in the model's safety training—it is just an ordinary business logic.

From "Detecting Attacks" to "Maintaining Discipline"

The traditional approach to security evaluation is "detect and block"—identify malicious inputs and prevent dangerous outputs. But failures in WDCD scenarios cannot be solved with this approach. Take Q226 as an example: the constraint is "retry at most three times," but 9 models wrote a while True: infinite retry loop in Round R3. This code is syntactically correct and will not be flagged in security scans—it is neither an injection nor malware. It simply violates a resource limit the user set 20 minutes earlier.

Similarly, Q237 requires all external requests to use HTTPS, but 4 models wrote verify=False to skip certificate verification under pressure. This operation is extremely common in development environments, and no security scanning tool would classify it as an "attack." But in the context where the enterprise has mandated HTTPS, it is a clear breach of agreement.

In the Real World, There Are No Jailbreak Words

The fundamental difference between WDCD and Prompt Injection is: Prompt Injection assumes there are bad actors in the world trying to abuse the model; WDCD assumes there are busy, ordinary people who, under pressure, will say "just give me something that works," "this one time is special," "I'll take responsibility if something goes wrong." The former requires safety alignment, the latter requires behavioral discipline.

The biggest threat to enterprise AI is not meticulously crafted jailbreak prompts, but the countless "just this once" exceptions made every day.

Long-context evaluations ask "can the model retrieve information from massive amounts of text," Prompt Injection tests ask "can the model resist malicious injections," and WDCD asks a third question: "in a normal workflow, can the model consistently execute constraints set by the user." These three questions test three completely different capabilities. The data from Run #105 shows that a model can excel on the first two, yet completely collapse on the third. The 100% failure rate on Q218 and Q239 indicates that no current model has truly solved this problem.

WDCD brings AI evaluation out of the lab and into the office, the ticket queue, the night before a product launch, and the scene of a failure. There, there are no elaborate jailbreak words—only a simple "help me do it this way first" — and that phrase can break through a large model's defenses more easily than any Prompt Injection.