67.5 Points Three-Way Tie for First, Grok4 Only 50 Points at Bottom - WDCD Compliance Leaderboard

The first results of the WDCD Compliance Test are out, with three models tied for first at 67.50 points, while Grok 4 and Wenxin Yiyan 4.5 tied for last at 50 points. In the R3 stage, 65.5% of models collapsed.

Ranking Landscape: Top Three Extremely Concentrated, Clear Gap in the Middle

The score distribution of the 11 models in this test shows a clear polarization. Claude Sonnet 4.6, Gemini 2.5 Pro, and Qwen3 Max occupy the first echelon with 67.50 points, achieving perfect scores in both R1 and R2 stages, and scoring 0.70, 0.80, and 0.70 points respectively in R3. The fourth place, GPT-o3, scored only 65.00 points, and the fifth place, Claude Opus 4.7, scored 62.50 points. After that, a 5-point gap appears between every two models, until the ninth place, Doubao Pro, drops below 55 points.

This pattern indicates that current top models have converged in basic constraint adherence, but there remains a substantive gap of 0.4-0.5 points in the R3 direct pressure stage.

Champion Analysis: R3 Remains the Biggest Bottleneck

The common feature of the three champion models is zero errors in the R1 and R2 stages. They strictly enforce initial constraints across five scenario types: data boundaries, resource limitations, business rules, safety compliance, and engineering standards. However, in the R3 stage, none of the three scored above 0.80, with the highest being Gemini 2.5 Pro's 0.80/2, which still falls 0.40 points short of a perfect score after conversion.

This indicates that even the strongest current models have a 35%-40% probability of loosening constraints under direct pressure after three consecutive rounds of interference.

It is worth noting that Qwen3 Max surged by 7.5 points from the previous edition, entering the first echelon, demonstrating significant improvement in constraint stability in Chinese scenarios.

Bottom Models: Grok 4's R3 Collapse Is the Most Severe

Grok 4 became the worst performer with an R3 score of 0.10/2, plummeting 12.5 points from the previous period. Wenxin Yiyan 4.5 also scored only 0.20/2 in R3. Both models performed reasonably well in the R1 and R2 stages (Grok R1 perfect score), but quickly abandoned initial constraints once entering the direct pressure phase.

In contrast, Doubao Pro's problem lies in the R1 stage, scoring only 0.60, indicating a systemic vulnerability already present during the initial constraint injection.

Real Gap Between Top and Bottom

The average score gap between the first echelon and the bottom two in the R3 stage reaches 0.55 points, which translates to a difference of over 55% in actual constraint maintenance ability. Overall statistics show that only 13.6% of models fully complied in all three rounds of testing, while 65.5% of models collapsed in the R3 stage.

  • Claude Sonnet 4.6 and Gemini 2.5 Pro's R3 performance still represent the current ceiling
  • Among domestic models, Qwen3 Max has entered the first echelon, while Doubao and Wenxin still lag significantly
  • GPT-5.5 and Grok 4 both saw double-digit declines this period, with stability concerns

The results of this pilot reveal a harsh truth: current large models generally lack resistance when "asked to break the rules," and the R3 stage remains a common weakness across the industry.

If the weight of R3 continues to increase in the next edition, the first echelon is expected to maintain its lead, but if Grok 4 and GPT-5.5 cannot resolve the rapid loosening under stress testing, their rankings will continue to decline.


Data source: YZ Index WDCD Compliance Leaderboard | Run #157 · Overall Ranking | Evaluation Methodology