GPT-5.5 Plunges 19.2 Points! Six Models Show Collective Regression in WDCD Rule-Keeping Test

May 20, 2026 469 Views - Read Source Winzheng Index

WDCD Compliance Test 模型对齐 AI能力退化 Claude优势

The latest WDCD cycle change tracking shows that six out of eleven evaluated models experienced significant declines, with zero models achieving positive growth. GPT-5.5 saw the steepest drop of 19.2 points, making it the biggest loser; DeepSeek V4 Pro, Gemini 3.1 Pro, GPT-o3, and Qwen3 Max all declined by 8–12.5 points. The collective regression in rule-keeping ability has become the most prominent signal.

Who Is Regressing: Specific Evidence from the Data

Compared to Run #120, the directly affected models in this round include:

GPT-5.5: -19.2 points, score on the R3 pressure round dropped from a perfect 2 to 0.4
DeepSeek V4 Pro: -12.5 points, showed constraint loosening after R2 irrelevant topic interference
GPT-o3: -10.8 points
Qwen3 Max: -10 points
Gemini 3.1 Pro: -8.3 points
Gemini 2.5 Pro: -6.7 points

All these scores come from 100% rule-based scoring, with no AI subjective judges. The R3 round accounts for 50% of the total score, and most models failed directly in this round, indicating a significant weakening in their resistance to "direct pressure to break constraints."

Possible Causes: Model Updates or Changes in Prompt Sensitivity?

The biggest decliners, GPT-5.5 and GPT-o3, both come from OpenAI. Recent versions generally strengthen the "helpful" attribute, emphasizing user intent satisfaction during training, which directly conflicts with the WDCD test's requirement to "strictly adhere to initial constraints." The business rules or safety compliance constraints injected in R1 are quickly abandoned under the high pressure of R3, reflecting an imbalance in the trade-off between helpful and harmless during alignment training.

The simultaneous decline of both Gemini versions suggests that internal prompt template or safety layer adjustments by Google may have reduced robustness against "irrelevant topic interference." Qwen3 Max's 10-point drop may be related to Alibaba's recent emphasis on multi-turn dialogue fluency optimization, which often comes at the cost of strict constraint adherence.

Notably, Claude Opus 4.7 and Claude Sonnet 4.6 did not appear on the decline list, maintaining the top two positions with 65 and 62.5 points respectively. This is no coincidence. Anthropic's constitutional AI training path naturally prioritizes "inviolable rules," resulting in higher stability under three-round dialogue stress testing.

Claude's average score in the R3 round is 1.6 points, while GPT-5.5 only scored 0.4—a gap that has widened to four times.

Trend Judgment: Rule-Keeping Ability Is Becoming a New Watershed

In the current Top 5, the ranking is: two Claude models, Doubao Pro, Gemini 2.5 Pro, and Qwen3 Max. Doubao Pro ranks third with 60 points, showing that domestic models remain competitive in engineering norms and resource-constrained scenarios, but still lag significantly behind Claude when facing safety compliance constraints.

Although this pilot phase is not counted in the main leaderboard, it clearly reveals a trend: models that pursue only dialogue naturalness and task completion are paying a price in rule-keeping. Over the next 3–6 months, if OpenAI and Google continue their current iteration paths, the decline of GPT and Gemini on WDCD may widen further, while Claude's lead could continue to grow.

More concerning is that the R2 "irrelevant topic interference" round has become a common weakness for most models. This indicates that current mainstream models still lack long-term memory and priority maintenance of contextual constraints. Once the conversation is diverted from the original constraint topic, models tend to "start over" rather than "continue to comply."

Overall, this round of changes is not random fluctuations but a concentrated eruption of conflict between model training objectives and real-world enterprise usage scenarios. Rule-keeping testing is rapidly rising from a marginal dimension to a core metric for evaluating whether a model can truly be used in high-risk business applications.

Claude's sustained lead is not the finish line, but a wake-up call for all chasers: models that have not undergone rigorous constitutional alignment will repeatedly fail under real enterprise constraints.

Data source: YZ Index WDCD Compliance Ranking | Run #125 · Change Tracking | Evaluation Methodology

GPT-5.5 Plunges 19.2 Points! Six Models Show Collective Regression in WDCD Rule-Keeping Test

Who Is Regressing: Specific Evidence from the Data

Possible Causes: Model Updates or Changes in Prompt Sensitivity?

Trend Judgment: Rule-Keeping Ability Is Becoming a New Watershed

Related Reviews

Winzheng Index WDCD v3.1: DeepSeek V4 Pro Up 26.2 Points, Claude Sonnet 4.6 Down 5.9

Winzheng Index WDCD v3.1 Five-Scenario Cross-Evaluation: Business Rules Score 1.3 at the Bottom, 11 Models Show Subject Imbalance of 2.1 Points

Winzheng Index R3 Integrity Rate Only 61.4%: Claude Sonnet's 20% Collapse Rate Exposes Three-Round Degradation Fault

Winzheng Index WDCD Review: Business Rules Scenario Lowest at 1.55, grok-4 Wins Security Compliance with 3.86