GPT-5.5 Plunges 19.2 Points! Six Models Show Collective Regression in WDCD Rule-Keeping Test
This WDCD cycle tracking reveals six out of eleven evaluated models experienced significant declines, with zero models showing positive growth. The most notable loser is GPT-5.5 with a drop of 19.2 points, while DeepSeek V4 Pro, Gemini 3.1 Pro, GPT-o3, and Qwen3 Max all declined by 8–12.5 points, highlighting a widespread regression in rule-keeping ability.