Claude Scores Largest Increase of 19.8 Points; All Eight WDCD Models Rise, None Decline

Jun 28, 2026 50 Views - Read Source Winzheng Index

WDCD Compliance Test 模型性能变化 Gemini 3.1 Pro Claude Opus 4.7

In this WDCD cycle (Run #196), among all participating models, eight showed positive changes and none declined. Claude Opus 4.7 increased by 19.8 points, leaping from its previous score to 89.29 points, entering the top three.

Performance of the Model with the Largest Increase in Compliance Testing

Claude Opus 4.7 showed the most significant score improvement across three rounds of constraint testing. Under the R3 direct pressure segment (weighted at 2 points), its performance improvement was most pronounced. DeepSeek V4 Pro and Doubao Pro both gained +13.5 points, currently tied at 89.29 points. Gemini 2.5 Pro increased by 16 points, and Gemini 3.1 Pro by 13.9 points, jointly propelling the Gemini series to occupy the top two spots in this round of compliance testing.

Top 5 Rankings and Specific Scores

Gemini 3.1 Pro WDCD=93.57, Grok 4 WDCD=92.86, Claude Opus 4.7 WDCD=89.29, DeepSeek V4 Pro WDCD=89.29, Qwen3 Max WDCD=88.57. The gap between Gemini 3.1 Pro and Grok 4 is only 0.71 points, a narrow lead. Claude Opus 4.7 and DeepSeek V4 Pro are tied for third place, trailing the first-place model by 4.28 points.

Claude Opus 4.7 +19.8 points this round, Gemini 3.1 Pro +13.9 points, a difference of 5.9 points in their increases.

Possible Implications of the Differences in Increases

GPT-5.5 gained only +5.7 points, the smallest increase among the listed models. Grok 4 +10.8 points and GPT-o3 +10 points both fall in the mid-range. The varying score distributions across the three stages (R1 constraint injection, R2 irrelevant distraction, R3 pressure) may be related to each model's sensitivity to long-context constraints. Current data only shows score changes but does not provide a breakdown of specific scores per round.

Gemini 3.1 Pro currently at 93.57 points, higher than Grok 4's 92.86 points
Claude Opus 4.7 increased by 19.8 points, higher than Gemini 3.1 Pro's 13.9 points
All 8 models show positive changes; GPT-5.5's 5.7-point increase is the smallest known

The pilot phase covers 35 questions across five scenarios: data boundaries, resource limits, business rules, safety compliance, and engineering standards. The score changes only reflect the three-round dialogue stability of models under these constraints. Claude Opus 4.7's significant improvement may stem from its adjusted response strategy in the R3 pressure segment; Gemini 3.1 Pro maintains a high level of consistency across all three rounds.

Trend Observations

All models showed positive changes in this cycle, with no score declines. Gemini 3.1 Pro and Grok 4 form the leading group, while Claude Opus 4.7 entered the top tier with the single-round largest increase. If similar disparity in increases continues in subsequent cycles, the specific score distribution of each model across safety compliance and engineering standard scenarios needs to be observed.

Under data boundary and resource limit scenarios, the model's ability to consistently follow constraints remains a core variable. This is only a single comparison from Run #196, making it impossible to determine long-term trends.

Data source: YZ Index WDCD Compliance Leaderboard | Run #202 · Change Tracking | Evaluation Methodology

Claude Scores Largest Increase of 19.8 Points; All Eight WDCD Models Rise, None Decline

Performance of the Model with the Largest Increase in Compliance Testing

Top 5 Rankings and Specific Scores

Possible Implications of the Differences in Increases

Trend Observations

Related Reviews

Winzheng Index Gemini 3.1 Pro Scores 93.57 Points, Tops WDCD Compliance Rankings; 文心一言4.5 Only 75.71 Points, Last Place

Winzheng Index WDCD Review: Safety Compliance Becomes the Biggest Weakness, Highest Score Among 11 Models Only 3.57

Winzheng Index Grok 4 Zero Crashes Overwhelms GPT-o3's 17% Collapse: WDCD Three-Round Attenuation Reveals True Resilience

Winzheng Index WDCD Three-Round Attenuation Test: GPT-o3 R3 Collapse Rate 50%, Qwen3 Max Zero Collapse