Claude Scores Largest Increase of 19.8 Points; All Eight WDCD Models Rise, None Decline

In this WDCD cycle (Run #196), among all participating models, eight showed positive changes and none declined. Claude Opus 4.7 increased by 19.8 points, leaping from its previous score to 89.29 points, entering the top three.

Performance of the Model with the Largest Increase in Compliance Testing

Claude Opus 4.7 showed the most significant score improvement across three rounds of constraint testing. Under the R3 direct pressure segment (weighted at 2 points), its performance improvement was most pronounced. DeepSeek V4 Pro and Doubao Pro both gained +13.5 points, currently tied at 89.29 points. Gemini 2.5 Pro increased by 16 points, and Gemini 3.1 Pro by 13.9 points, jointly propelling the Gemini series to occupy the top two spots in this round of compliance testing.

Top 5 Rankings and Specific Scores

Gemini 3.1 Pro WDCD=93.57, Grok 4 WDCD=92.86, Claude Opus 4.7 WDCD=89.29, DeepSeek V4 Pro WDCD=89.29, Qwen3 Max WDCD=88.57. The gap between Gemini 3.1 Pro and Grok 4 is only 0.71 points, a narrow lead. Claude Opus 4.7 and DeepSeek V4 Pro are tied for third place, trailing the first-place model by 4.28 points.

Claude Opus 4.7 +19.8 points this round, Gemini 3.1 Pro +13.9 points, a difference of 5.9 points in their increases.

Possible Implications of the Differences in Increases

GPT-5.5 gained only +5.7 points, the smallest increase among the listed models. Grok 4 +10.8 points and GPT-o3 +10 points both fall in the mid-range. The varying score distributions across the three stages (R1 constraint injection, R2 irrelevant distraction, R3 pressure) may be related to each model's sensitivity to long-context constraints. Current data only shows score changes but does not provide a breakdown of specific scores per round.

  • Gemini 3.1 Pro currently at 93.57 points, higher than Grok 4's 92.86 points
  • Claude Opus 4.7 increased by 19.8 points, higher than Gemini 3.1 Pro's 13.9 points
  • All 8 models show positive changes; GPT-5.5's 5.7-point increase is the smallest known

The pilot phase covers 35 questions across five scenarios: data boundaries, resource limits, business rules, safety compliance, and engineering standards. The score changes only reflect the three-round dialogue stability of models under these constraints. Claude Opus 4.7's significant improvement may stem from its adjusted response strategy in the R3 pressure segment; Gemini 3.1 Pro maintains a high level of consistency across all three rounds.

Trend Observations

All models showed positive changes in this cycle, with no score declines. Gemini 3.1 Pro and Grok 4 form the leading group, while Claude Opus 4.7 entered the top tier with the single-round largest increase. If similar disparity in increases continues in subsequent cycles, the specific score distribution of each model across safety compliance and engineering standard scenarios needs to be observed.

Under data boundary and resource limit scenarios, the model's ability to consistently follow constraints remains a core variable. This is only a single comparison from Run #196, making it impossible to determine long-term trends.


Data source: YZ Index WDCD Compliance Leaderboard | Run #202 · Change Tracking | Evaluation Methodology