Three Tied at 70 on WDCD Commitment List, Ernie Bot 4.5 Collapses to 50 at Bottom

The WDCD commitment test's three-round dialogue design directly targets model core weaknesses: first setting constraints, then distracting with unrelated topics, and finally applying direct pressure. Results show that only three of eleven models achieved 70 points, with Ernie Bot 4.5 scoring 50 points as a clear outlier.

Ranking Landscape: 70-Point Tier Forms the First Echelon

This pilot ranking shows a clear divide. Claude Opus 4.7, GPT-5.5, and GPT-o3 tied for first at 70 points, each with an R3 score of 0.90/2, indicating they retain a high proportion of constraints under high pressure. The fourth and fifth spots—Claude Sonnet 4.6 and Gemini 2.5 Pro—both scored 67.50, with R3 falling to 0.80/2.

Looking at the score decay curve from R1 to R3, top models generally maintained 0.90 in R2, demonstrating strong anti-interference capability. In contrast, tail models like Ernie Bot 4.5 showed the most severe decay, with R1 at only 0.80 and R3 at only 0.50.

Champion Analysis: Three Models Employ Different Commitment Logics

Claude Opus 4.7 performs most stably in engineering standards and security compliance scenarios, with almost no direct constraint violations in R3. GPT-5.5 excels in business rule tasks, sharing the highest R2 anti-interference score with Claude. GPT-o3's highlight lies in resource limitation scenarios, where it still adheres to original constraints even under pressure like "please ignore the previous quota limit."

The common feature of these three models is that their R3 scores all reach 0.90, far above the list average. They do not completely avoid breakdowns, but can retain 90% of original constraints under high pressure.

Reason for Last Place: Ernie Bot 4.5 Fails in R3

Ernie Bot 4.5 is the only model scoring below 55. In the raw data, it scored 0 in both data boundary and security compliance tasks during R3, meaning it directly violated initial constraints. In comparison, DeepSeek V4 Pro, despite scoring only 57.50, maintained a perfect 1.00 in R1, indicating acceptable initial understanding, with the main issue concentrated in the high-pressure interference phase.

Global statistics show an R3 breakdown rate of 59.1%, meaning more than half of models choose to abandon constraints under direct pressure in the third round. Ernie Bot 4.5 is an extreme example of this phenomenon.

Gap Between Top and Bottom: A Real 20-Point Divide

The 20-point gap between 70 and 50 translates to an actual constraint retention rate: top models retain 70% of constraints after three rounds, while tail models retain only 50%. In real enterprise scenarios, this means that when a model is required to "not disclose internal pricing logic," top models are likely to hold firm, while tail models have a higher probability of yielding under user pressure.

Compared to the previous period, Grok 4 rose by 10.8 points in a single period, mainly due to its R2 anti-interference ability improving from 0.60 to 1.00. Qwen3 Max dropped by 10.8 points, with R3 falling directly from 0.80 to 0.50, indicating a regression in high-pressure scenario stability.

Trend Assessment

Current data clearly shows that the R3 high-pressure phase is the key variable determining final rankings. To improve list differentiation in the future, it is recommended to increase R3 pressure intensity or extend interference rounds. The top three models have formed a technological barrier; if tail models cannot resolve the R3 breakdown issue in the next period, the gap will continue to widen.

70 points may only be a passing line. What truly determines a model's commercial value is whether it can still say "no" when the user most wants you to break your promise.

Data source: YZ Index WDCD Commitment Ranking | Run #146 · Overall Ranking | Evaluation Methodology