WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes

The most direct conclusion of this WDCD cycle is: GPT-5.5 has re-established the ceiling of instruction adherence capability with an absolute advantage of 71.67 points, while Gemini 2.5 Pro's huge leap of 14.2 points has completely rewritten the existing impression that "Google models are relatively weak in instruction following."

GPT-5.5: Stability Under Three Rounds of Interference from a Score of 71.67

Compared to Run #115, GPT-5.5's R3 pressure score improvement was most significant this cycle, reaching 1.8 points out of a near-perfect 2.0. This indicates that under the compound constraint scenario of "resource limitations + safety compliance," its resistance to directly breaking instructions in the third round has notably strengthened. Qwen3 Max follows closely with 67.50 points, a gap of only 4.17 points, showing that domestic models are accelerating their catch-up in engineering specification tasks.

Gemini 2.5 Pro's 14.2-Point Leap: Model Update or Prompt Sensitivity Restructuring?

This cycle, Gemini 2.5 Pro saw the largest increase, with its R2 off-topic interference phase score rising from 1.2 to 2.8, almost eliminating its previous weakness of rule collapse after interference. Combined with its recent context window expansion and safety fine-tuning records, it is highly likely that a specialized optimization has been performed for the multi-turn conversation constraint maintenance mechanism. In contrast, Claude Opus 4.7 only rose by 6.7 points, a relatively moderate increase, indicating that its baseline was already high and marginal improvement space has narrowed.

Wenxin Yiyan 4.5's 7.5-Point Plunge: Isolated Case or Signal?

Wenxin Yiyan 4.5, the only model to decline, fell from approximately 55 points in the previous cycle to 47.5 points, with its R3 phase score halved. This is particularly prominent in the two scenario categories of "data boundaries" and "business rules." Given the frequency of its training data updates and tightening of safety policies, it is highly likely that a "over-alignment" induced rule rigidity problem has emerged — the model tends to directly refuse or deviate from the original constraints under high pressure in the third round, rather than finding compliant solutions within the boundaries.

Trend Judgment: Instruction Adherence Capability Enters the "Update-Driven" Era

  • The simultaneous rise of GPT series and Gemini confirms the focused investment by OpenAI and Google in multi-turn context consistency recently.
  • Grok 4 rose by 10 points, showing that xAI is catching up in engineering specification constraints.
  • Only one model declined, but the drop is concentrated, suggesting that some domestic models may have entered a bottleneck period in iteration.

Among the current Top 5, GPT-5.5, Qwen3 Max, and Claude Opus 4.7 form the first tier, with gaps within 5 points, indicating that the competition has entered a white-hot phase.

When adherence testing shifts from static single-turn to dynamic three-turn interference, the real difference between models is no longer "whether they can answer," but "whether they can still hold onto the first-turn commitment when cornered in the third turn."

If Gemini continues to maintain an increase of 14 points in the next cycle, GPT-5.5's lead will be compressed to within 3 points; if Wenxin Yiyan cannot stop its decline in the R3 phase, it may fall out of the top eight. Instruction adherence capability has evolved from "icing on the cake" to "life-or-death line," and the pace of model updates will directly determine the ranking.


Data source: Winzheng YZ Index WDCD Adherence Ranking | Run #120 · Change Tracking | Evaluation Methodology