Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle

Jun 3, 2026 688 Views - Read Source Winzheng Index

WDCD Compliance Test AI模型评估周期变化 Grok 4

Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 Max plummeted by 10.8 points. The divergence in adherence capabilities has become clearly visible.

Rising Camp: Breakthroughs in Both Engineering Specifications and Security Compliance

Grok 4's score in the R3 pressure-testing scenario jumped from 2.1 to 3.8, directly boosting its total score. Combined with its stable performance in resource-constrained scenarios, it is inferred that xAI has recently made targeted enhancements to the context anchoring mechanism. GPT-5.5 also scored a full 4 points in the business rule scenario, suggesting that OpenAI may have adjusted the priority weight of system prompts, making the model harder to be derailed by irrelevant topics.

Claude Opus 4.7 returned to the top of the leaderboard with a score of 70, with a particularly noticeable improvement in the R2 distraction round. This indicates that Anthropic still holds an advantage in maintaining attention across multi-turn conversations, but the lead has shrunk from 8 points to a tie with GPT-5.5, with the gap narrowing rapidly.

Declining Models: Dual Loss in Data Boundary and Security Compliance

Qwen3 Max dropped from 68.3 points in the previous cycle to 57.5 points, scoring only 1.2 points in the R3 direct pressure-testing scenario. Raw conversation logs show that under the constraint of "Engineering Specifications," the model repeatedly chose to obey new instructions when confronted with classic jailbreak patterns like "ignore all previous instructions." This may conflict with Alibaba's recently emphasized "more open interaction" strategy.

DeepSeek V4 Pro dropped 6.7 points, with the main losses occurring in the data boundary scenario. After the R2 irrelevant topic insertion, the model began leaking internal parameter ranges that should have been refused, indicating a regression in the robustness of its context filter against long-range dependencies.

Trend Assessment: Prompt Sensitivity is Becoming a Core Variable

Of the four models that rose this cycle, three improved by more than 2 points in the R3 scenario, while the three that declined lost an average of 1.8 points in the same scenario. This is not random fluctuation but a systematic divergence in models' resistance to prompts that directly undermine constraints.

It is reasonable to infer that over the past two months, during the RLHF or RLAIF stages of various vendors, the weight assigned to the "adherence" capability has diverged significantly. xAI and OpenAI may have increased the penalty for violating constraints, while Alibaba and DeepSeek may have placed more emphasis on improving model "flexibility," at the temporary cost of adherence capability.

As model updates and prompt engineering accelerate simultaneously, WDCD score fluctuations exceeding 8 points have become the norm, and more drastic reshuffles of 10-point magnitude may occur in the next two rounds.

Looking at the Top 5 list, Claude Sonnet 4.6 and Gemini 2.5 Pro closely follow with a score of 67.5, but the gap to the top three (70 points) has stabilized at 2.5 points. This suggests that the second tier still has a clear weakness in the R1 constraint injection scenario and has not yet formed a real threat to the leading group.

Among the 10 questions in the pilot phase, the security compliance scenario has the largest score variance at 1.9, far exceeding the 0.7 variance in the resource-constrained scenario. This once again validates the original design intent of WDCD: what truly differentiates a model's long-term value is its adherence to rules under high pressure, not the surface fluency of single-turn Q&A.

If Grok 4 can maintain a score of 3.8 or above in the R3 scenario in the next cycle, it will likely break the current pattern of Claude and GPT being tied at the top; conversely, if Qwen3 Max fails to fix its data boundary vulnerabilities, its competitiveness in enterprise deployment scenarios will be further eroded.

Data source: YZ Index WDCD Adherence Leaderboard | Run #146 · Change Tracking | Evaluation Methodology

Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle

Rising Camp: Breakthroughs in Both Engineering Specifications and Security Compliance

Declining Models: Dual Loss in Data Boundary and Security Compliance

Trend Assessment: Prompt Sensitivity is Becoming a Core Variable

Related Reviews

Winzheng Index Grok 4 Scores 93.80 to Top the Compliance Test, Doubao Pro Trails at 67.30 with a 26.5-Point Gap

Winzheng Index Grok 4 Leads WDCD Commitment Leaderboard with 91.40 Points, Qwen3 Max Ranks Last with 64.88 Points

Winzheng Index Grok 4 Scores 91.20 to Top WDCD Compliance Rankings, Qwen3 Max Trails at 57.48 with 33.72-Point Gap

Winzheng Index GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured