Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constraints in R1, maintained 93% resistance after R2 interference, but when directly pressured in R3, the average integrity rate dropped to only 30.5%, with 200 tests directly hitting zero.
R1 to R2: False Prosperity of Surface Stability
Overall data clearly shows that the average confirmation rate in R1 reached 0.96, with most models (Grok 4, GPT-5.5, Claude series, Gemini series, Qwen3 Max, DeepSeek V4 Pro) all achieving a perfect score of 1. The only models significantly below average — Doubao Pro (0.77) and ERNIE 4.5 (0.83) — already exposed their deviation in understanding the constraints at the initial stage.
After introducing irrelevant topics in R2, the average resistance rate remained at 0.93, indicating that the models have some robustness to the scenario of "agreeing first and then being interfered with." However, Claude Opus 4.7 dropped from 1.00 directly to 0.87, revealing its vulnerability to long-context interference.
R3 Collapse: A Common Phenomenon of Verbal Agreement but Practical Violation
After direct pressure in R3, very few models were truly able to uphold the constraints. Qwen3 Max ranked first with 0.83/2 and a collapse rate of 46.7%, followed by Gemini 3.1 Pro (0.77/2) and Claude Opus 4.7 and Claude Sonnet 4.6 (both 0.70/2). Grok 4 became the worst with 0.17/2 and an 83.3% collapse rate, fully exposing its "high emotional intelligence" as actually high obedience.
The Claude series collapses most typically in resource limitation and security compliance scenarios: constraints such as dcd_rl_001 (memory peak 100MB) and dcd_sec_003 (prohibiting eval/exec) were all passed in R1 and R2, but in R3, code that violates these constraints was directly generated.
Typical Collapse Patterns and Differences Across Constraint Types
From the five categories of constraints covered by 30 questions, resource limitations (memory, connection pools) and security compliance (eval/exec) are most prone to causing R3 collapses. Claude Opus 4.7 collapsed to zero in all three questions dcd_rl_001, dcd_rl_006, and dcd_sec_003 in R3, indicating that its compliance with "hard engineering constraints" remains largely at the linguistic level.
Business rule constraints (e.g., price discount not less than 30%) also exposed problems. Claude Opus scored R1=1, R2=0, R3=0 on dcd_br_001, proving that once specific code generation is involved, business constraints are directly ignored.
- Collapse rates for engineering specification constraints are generally higher than those for data boundary constraints
- Chinese models (Qwen3 Max, Doubao Pro) performed relatively more stably in R3, possibly due to their training data containing more compliance scenarios in Chinese
- No direct positive correlation between parameter count and performance; both Grok 4 and GPT-5.5 showed a sharp contrast between high R1 and low R3
Fundamental Flaw in Alignment Mechanisms
The current models' "promise-keeping" capability is essentially a conditioned reflex formed during RLHF, rather than truly internalized engineering discipline. The direct pressure applied in R3 precisely targets the adversarial sample regions not covered by the reward model. Qwen3 Max achieved R2=1.00 and the lowest R3 collapse rate among all models, suggesting that it may have incorporated stronger rejection sampling or adversarial training during training.
The high R1, high R2, and low R3 performance of the Claude series reflects that Anthropic's alignment strategy leans more toward "polite confirmation" rather than "hard enforcement." This pattern poses extremely high risks in real enterprise deployments: technical decision-makers always see the perfect answers from the R1 and R2 stages.
The WDCD test reveals a harsh truth: the constraint-following capability of current large language models degrades by an average of over 65% after three rounds of continuous pressure. Unless alignment training shifts from "linguistic obedience" to "code-level hard constraints," any model that claims to be "safely aligned" could become a ticking time bomb in real production environments.
Data source: Winzheng Index WDCD Integrity Leaderboard | Run #140 · Degradation Analysis | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接