Gemini 3.1 Pro Scores 93.57 Points, Tops WDCD Compliance Rankings; 文心一言4.5 Only 75.71 Points, Last Place

Gemini 3.1 Pro leads this WDCD compliance ranking with 93.57 points (R1=1.00, R2=0.97, R3=1.77/2), while 文心一言4.5 ranks 11th with 75.71 points (R1=0.89, R2=0.60, R3=1.54/2).

Ranking Pattern: Concentrated at Top, Disconnected at Bottom

This WDCD ranking shows clear tier differentiation. The top three models—Gemini 3.1 Pro (93.57), Grok 4 (92.86), and Claude Opus 4.7 (89.29)—are separated by only 1-3 points, while a gap of 5.72 points exists between 7th place 豆包 Pro (81.43) and 11th place 文心一言4.5 (75.71). The overall perfect score rate is only 59.2%, with an R3 collapse rate of 8.8%, indicating that most models exhibit varying degrees of constraint violations when directly pressured in the third round.

Champion Analysis: Gemini 3.1 Pro Excels in Both R2 and R3

Gemini 3.1 Pro’s advantage lies in its R2 interference resistance and R3 pressure tolerance. Its R2 score of 0.97 exceeds Grok 4’s 0.89 and Claude Opus 4.7’s 0.89; its R3 score of 1.77 ties with DeepSeek V4 Pro for the highest. Together, these two metrics form the basis for its leading 93.57 points. In comparison, while Grok 4 achieves a single-round high of 1.83 in R3, its R2 score of only 0.89 results in a 0.71-point deficit in total score.

Bottom Analysis: 文心一言4.5 Low in Both R1 and R2

文心一言4.5’s R1 score of 0.89 and R2 score of 0.60 are the lowest among the 11 models, directly causing its last-place finish. Although its R3 score of 1.54 is higher than GPT-o3’s 1.34, the cumulative loss of 0.51 points in the first two rounds cannot be recovered through R3. The R2 score of 0.60 indicates that this model is most prone to deviating from initial constraints during the irrelevant-topic interference stage.

Gap Between Top and Bottom: R3 Becomes Decisive Variable

The average R3 score for the top six models is 1.72, while for the bottom five it is only 1.50. Claude Opus 4.7 and DeepSeek V4 Pro both score 89.29, but the former’s R2 of 0.89 exceeds the latter’s 0.83, showing that minor R2 differences can determine tied rankings. GPT-5.5 (81.43) achieves a perfect R1 of 1.00 but drops to 8th place due to an R2 of only 0.66, confirming the amplifying effect of the R2 interference stage on overall ranking.

Compared with the previous edition, Claude Opus 4.7 improved by 19.8 points and Gemini 2.5 Pro by 16.0 points, both primarily due to rebounded R3 scores. GPT-5.5 improved by only 5.7 points, the smallest increase among the 11 models, with R2 still stagnant at the low level of 0.66.

In the WDCD compliance test, score differences in the R3 pressure stage directly determine the final ranking spread of 0.71–5.72 points.

In this pilot phase, 35 questions cover five types of constraint scenarios. Data indicates that model R3 collapse is more concentrated in engineering specification and security compliance scenarios. Gemini 3.1 Pro maintains an R3 above 1.80 in both scenarios, while 文心一言4.5’s R3 drops below 1.40 in the same scenario.

The overall pattern shows that R1 constraint injection pass rates are generally high, but R2 interference and R3 pressure remain the main weaknesses of current models. Gemini 3.1 Pro and Grok 4 achieve R3 scores above 1.77, forming the observable upper bound of compliance capability.

If the R3 collapse rate continues to hover around 8.8% in future iterations, the score gap between top and bottom models may widen further.


Data source: YZ Index WDCD Compliance Rankings | Run #202 · Overall Ranking | Evaluation Methodology