WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies

Jun 10, 2026 693 Views - Read Source Winzheng Index

WDCD Compliance Test Model Updates 指令遵循 AI Evaluation

In this WDCD cycle compared to Run #146, the most striking signal is that five mainstream models simultaneously experienced significant declines, with the highest drop reaching 12.5 points, while only Qwen3 Max achieved a positive gain of 7.5 points. The declining models include GPT-5.5, Grok 4, Doubao Pro, Claude Opus 4.7, and GPT-o3, with only one model rising, presenting a one-sided recession pattern overall.

Specific Declines and Top5 Restructuring

At the data level, GPT-5.5 and Grok 4 tied for the largest decline (-12.5), followed by Doubao Pro (-10), Claude Opus 4.7 dropping 7.5 points, and GPT-o3 slightly falling 5 points. Qwen3 Max, on the other hand, jumped 7.5 points from a lower position in the previous cycle, successfully entering the Top3 and tying with Claude Sonnet 4.6 and Gemini 2.5 Pro at 67.5 points. Among the current top five, Chinese models occupy two seats, indicating that domestic models are beginning to form local advantages in the compliance dimension.

Constraint Failure Under Multi-Round Interference

The WDCD design includes three progressive rounds: R1 injected constraints, R2 irrelevant topic interference, and R3 direct pressure. The models with the most significant score declines, GPT-5.5 and Grok 4, showed a marked increase in rule violation counts during the R3 phase. This suggests that after recent alignment updates, these models have experienced a systematic decrease in sensitivity to constraints like "business rules" and "engineering norms." The possible reason is that safety training now emphasizes "helpfulness" over "rigid adherence," making them more likely to concede under high-pressure questioning.

Although Claude Opus 4.7 also declined, it remains in the Top5, indicating that its base architecture still has stronger resistance to context decay than the GPT-5.5 series.

Possible Path for Qwen3 Max's Rally

Qwen3 Max is the only model with positive growth, achieving a gain of 7.5 points. Considering its record of maintaining constraints during the R2 interference phase, it is speculated that the team has recently conducted specialized fine-tuning for "multi-turn context consistency." This fine-tuning may include adding adversarial compliance samples or adjusting the weight ratio of "obeying the user" versus "adhering to preset rules" in RLHF. Either way, it is directly reflected in the score improvement under R3 pressure.

Trend Assessment: Shift from "Obedience" to "Pleasing"

The current trend shows that most Western models are undergoing a collective "compliance decay." This is not merely a side effect of version upgrades but a systematic shift in alignment strategy. When models are trained to be more willing to "please the user," the probability of violating rules under direct pressure in the R3 phase inevitably rises. In contrast, Qwen3 Max's contrarian performance shows that targeted optimization can still effectively recover scores, proving that the issue lies in training objectives rather than model capacity.

Data boundary constraints: GPT-5.5 and Grok 4 showed the fastest increase in violation rates
Safety compliance constraints: Claude Opus 4.7 remained relatively stable
Engineering specification constraints: Qwen3 Max showed the most significant improvement

These differences across the three dimensions point to different prioritization of rules during the RLHF phase for each model.

Predictions for the Next Cycle

If the GPT-5.5 and Grok 4 teams do not conduct specialized rework on compliance samples, the decline in the next round may continue to widen. Qwen3 Max, on the other hand, has room to rise further, potentially challenging the 67.5-point ceiling. If the Claude series maintains its current architecture, it will remain a benchmark for the compliance dimension in the short term, but its advantage is being rapidly eroded.

Compliance capability is becoming a key indicator to distinguish next-generation models, rather than mere dialogue fluency.

Data Source: YZ Index WDCD Compliance Rankings | Run #157 · Change Tracking | Evaluation Methodology

WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies

Specific Declines and Top5 Restructuring

Constraint Failure Under Multi-Round Interference

Possible Path for Qwen3 Max's Rally

Trend Assessment: Shift from "Obedience" to "Pleasing"

Predictions for the Next Cycle

Related Reviews

Winzheng Index Grok 4 Leads with 94.20 in Compliance, Claude and Gemini Both Drop Over 5 Points

Winzheng Index WDCD Five-Scenario Review: Business Rules Become the Hardest, Grok-4 Scores Perfect 4, Claude-sonnet Only 1.8

Winzheng Index R3 Integrity Rate Only 50.6%: Grok 4 Zero Collapse, GPT-o3 and Qwen3 Max at 20% Collapse

Winzheng Index GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured