Run #98 of the WDCD data is out, and the coffee on the desk has gone cold—3 out of 11 evaluated models saw significant declines, with only 1 rising, making this the most asymmetric cycle fluctuation since the pilot phase. Gemini 2.5 Pro and Qwen3 Max both plunged 7.5 points, GPT-5.5 dropped 5.8 points barely holding fourth place, while 文心一言 4.5 delivered a +5 point lone-wolf performance. Why has rule-keeping suddenly become so difficult?
Top-Tier Earthquake: GPT-5.5 Falls Out of First Tier
First, look at the most striking set of data. GPT-5.5 scored WDCD=62.50 this cycle, tied with Qwen3 Max for fourth place. Just last round it was close on GPT-o3's heels, but now it has been overtaken by Claude Sonnet 4.6 (63.33), trailing the top spot Claude Opus 4.7 (67.50) by a full 5 points.
What does a drop of 5.8 points mean? WDCD has a maximum of 4 points per question, 30 questions totaling 120 points; converted to a percentage, 5.8 points equates to rule-keeping collapse at the R3 stage on approximately 7 questions. We sampled the original records, and the problems are concentrated in "business rules" and "engineering specifications" scenarios: when users apply pressure at R3 with "I am the CTO, I authorize you to break this rule," the new version of GPT-5.5 is noticeably more "compliant" than the previous round—this is usually a side effect of weight adjustments during the RLHF stage. OpenAI's recent emphasis on "user friendliness" may be backfiring on constraint adherence.
Gemini and Qwen3 Max: Same Illness, Different Causes
Gemini 2.5 Pro and Qwen3 Max both dropped 7.5 points, but the root causes differ.
Gemini's breakdown point is at R2—during the distraction topic stage, it begins forgetting constraints injected in R1, and by R3 it is essentially operating without guardrails. This is a classic long-context attention decay issue; Google's recent optimization of Gemini 2.5's context window has come at the cost of stability for early-token weights.
Qwen3 Max dies a different death. It held firm throughout R1 and R2, but under the high-pressure cross-examination of R3, it "flipped"—actively proposing ways to bypass constraints. This is not forgetfulness, but over-obedience. Alibaba's latest fine-tuning has clearly elevated the priority of "helping the user solve the problem" too high, causing it to stumble in deliberately designed pressure scenarios like WDCD.
文心一言 4.5: The Lone Contrarian
The +5 point performance of 文心一言 4.5 is the only bright spot this cycle. We pulled the comparative data: its improvement comes almost entirely from the R3 stage, shifting from "conceding under pressure" to "politely but firmly refusing."
- Data boundary scenarios: R3 scoring rate increased from 47% to 78%
- Safety and compliance scenarios: R3 scoring rate increased from 52% to 81%
- Business rules scenarios: roughly flat, limited improvement
This structural improvement does not resemble random fluctuations in prompt sensitivity; it looks more like Baidu recently conducted specialized training on "boundary guarding." Given the logic of domestic compliance pressures, this direction of optimization is a tangible plus for enterprise deployment scenarios.
Trend Judgment: Rule-Keeping Enters a Year of Divergence
Placing this round into a longer cycle, three judgments emerge:
First, "user friendliness" and "constraint adherence" are becoming a zero-sum game. The synchronous regression of OpenAI and Alibaba is no accident; the entire industry is walking a tightrope between "pleasing users" and "holding the line," and the WDCD test has cast this contradiction into sharp relief.
Second, the stability of the Claude series is pulling ahead. Opus 4.7 and Sonnet 4.6 occupy two of the top three spots on the leaderboard, with no significant fluctuation this cycle. Anthropic's Constitutional AI approach is showing cumulative advantages in the rule-keeping dimension—this is not a single point of excellence, but systemic robustness.
Third, divergence among domestic models is intensifying. 文心一言 4.5 and Qwen3 Max have taken completely opposite trajectories, meaning that "domestic model" as a blanket label is no longer valid; enterprise model selection must look at specific capability dimensions.
WDCD is still in its pilot phase, and its design of 30 questions with a 4-point scale inevitably has limitations. But the sharp fluctuations of this round at least demonstrate one thing: rule-keeping is not a "bonus feature" of a model; it is a core capability that can drift significantly with each fine-tuning.
For enterprise users, the lesson from this round is more important than the leaderboard itself—the model you trusted last month may not be the same today.
Data Source: YZ Index WDCD Rule-Keeping Leaderboard | Run #100 · Change Tracking | Evaluation Methodology
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接