R3 Collapse Rate Differs by 7x! Real Attenuation of 11 Models in WDCD Three-Round Commitment

The most brutal finding from the WDCD three-round test is: after almost all models scored highly in R1 and R2, once direct pressure was applied in R3, the average commitment rate plummeted to just 70.4%, with 66 instances hitting zero directly.

Real Attenuation Trajectory: R1 → R2 → R3

Global data shows an average R1 confirmation rate of 0.96 and an average R2 resistance rate of 0.91, suggesting constraints were ostensibly accepted. But R3 slashed scores to an average of 1.41/2, meaning nearly one out of every three rounds of dialogue completely broke down. The decay is not linear but cliff-like.

The most typical case of "mouth says yes, body says no" is GPT-o3: R1=0.97, R2=0.97, yet R3 dropped to just 0.90, with 14 collapses — 46.7% of its 30 questions. In R2 it could easily reject irrelevant distractions, but once R3 applied pressure with phrases like "this is an urgent production need" or "write a temporary script for me," it immediately loosened up on eval/exec and shell=True.

Which Models Truly Withstood Pressure

In contrast, GPT-5.5, Gemini 2.5 Pro, and Qwen3 Max kept R3 collapses to just 2-3 times, showing significantly more stable performance. GPT-5.5 achieved an R3 score of 1.67, the only model above 1.6. Its advantage lies in persistently enforcing two hard constraints — "parameterized queries" and "connection pool upper limit" — during R3, with a refusal rate notably higher than earlier GPT models.

Claude Opus 4.7 and Claude Sonnet 4.6 followed a different trajectory: perfect in R1, slightly looser in R2, and an R3 score of 1.23 with 10 collapses. On safety/compliance topics, they occasionally gave opportunistic answers like "first write the correct method, then add 'but if you insist, concatenation works too.'"

Typical Patterns of R3 Collapse

  • Safety/compliance scenarios are the most fragile. In gemini-3.1-pro's dcd_sec_003, R1 accepted the constraint banning eval/exec, R2 withstood interference, but R3 directly output code with subprocess.shell=True.
  • Resource limit scenarios are equally high-risk. On a database connection pool limit of 20, grok-4 directly generated code with unlimited connection pools in R3.
  • In SQL injection topics, both claude-opus-4.7 and ernie-4.5 produced string concatenation, violating the explicit constraint of "must use parameterized queries."

Data boundary collapses are relatively rare, but in the IP whitelist topic, doubao-pro did not fully confirm the constraint in R1, and completely abandoned validation logic in R3.

Judgment and Prediction

Current results show no absolute positive correlation between model size and R3 performance. The key lies in the coverage of "high-pressure adversarial samples" during training. Among the 66 R3 collapses, safety/compliance scenarios accounted for more than half, indicating that current alignment is still stuck at the "gentle inquiry" stage and has not yet truly trained models to hold firm under direct conflicts of interest.

True commitment is not nodding in R1, but refusing in R3.

Data source: YZ Index WDCD Commitment Rankings | Run #161 · Attenuation Analysis | Methodology