Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline

Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline

In this WDCD cycle compared to Run #135, among all 11 evaluated models, 5 models rose and 0 declined. The overall trend is clear: compliance ability is collectively recovering.

Biggest Gains Not from Top 1, But from Chasers

Gemini 3.1 Pro became the biggest dark horse with +14.2 points, breaking directly into the Top 3 from outside the rankings, tying with Claude Sonnet 4.6 at 66.67 points. Doubao Pro +11.7 points and ERNIE 4.5 +10 points also achieved double-digit jumps. In comparison, GPT-o3 only gained +7.5 points, and Claude Opus 4.7 +6.7 points, showing relatively moderate increases.

This set of data breaks the intuition that "stronger means more stable." Although Qwen3 Max still holds first place with 70.83 points, its gain was not disclosed this round, indicating that its baseline is already very high and room for further upward movement is compressed.

When chasers approach the top with double-digit gains, it shows that instruction-following ability is rapidly converging.

Real Signals Under Three-Round Test Structure

WDCD adopts an R1 injection constraint, R2 irrelevant interference, and R3 direct pressure structure, with a full score of 4 points. Gemini 3.1 Pro can achieve a higher score in the R3 phase, meaning it is less prone to collapse when facing explicit adversarial instructions in both "business rules" and "security compliance" scenarios. The improvements in Doubao Pro and ERNIE 4.5 are also concentrated in R3, indicating a qualitative change in their sensitivity to "engineering specifications" type constraints.

Two possible reasons exist: first, recent model updates have strengthened the weight of system prompts; second, more adversarial compliance samples have been added during the training phase. Regardless, changes in prompt sensitivity are the core variable.

Deep Meaning of Zero Decline

In this cycle, no model showed a negative change, which is extremely rare in the past few pilots. Combined with the Top 5 list—Qwen3 Max, Claude Sonnet 4.6, Gemini 3.1 Pro, GPT-o3, Claude Opus 4.7—the score range among the five is only 6.66 points, and the gap has entered the range of statistical error.

This sends a clear signal: after Q2 2025, context constraint compliance is regressing from a "differentiating selling point" to a "passing line." Whoever can stably score 2 points in the R3 phase will take the lead in enterprise-level deployment.

  • Data boundary scenarios: Gemini 3.1 Pro performs most stably
  • Resource constraint scenarios: Doubao Pro shows the most significant improvement
  • Security compliance scenarios: ERNIE 4.5 catches up to the median

If zero decline continues in the next two cycles, the WDCD ranking may enter a "plateau phase," where the marginal gains from model updates will significantly decrease. The real watershed will lie in the ability to maintain constraint propagation across multiple rounds of long context.

Compliance testing is shifting from a bonus item to a passing line. Whoever falls first in the next round will be eliminated first.


Data source: YZ Index WDCD Compliance Ranking | Run #140 · Change Tracking | Evaluation Methodology