Gemini 3.1 Pro Drops 8.5 Points on Main Leaderboard, Code Execution Plummets 9.5 – Lottery or Degradation?

In today's Smoke evaluation, Gemini 3.1 Pro saw a sharp 8.5-point drop on the main leaderboard, with code execution falling from 66.70 to 57.20 and material constraints dropping from 86.30 to 79.00. The fluctuations are attributed to a combination of question sampling volatility and declining model consistency, placing the current status in an "observation period" rather than an "alert period."

Gemini 3.1 Pro Code Execution Smoke快测
294

R3 Collapse Rate 85%! 11 Models WDCD Three-Round Test: The True Decay Curve from Promise to Betrayal

The WDCD test uses three rounds of escalating pressure to precisely capture the trajectory of promise-keeping collapse under sustained pressure. In Stage R1, almost all models gave near-perfect confirmations with an average confirmation rate of 0.98; after introducing irrelevant distractions in Stage R2, the resistance rate remained at 0.89; however, entering the direct pressure Stage R3, the average integrity rate plummeted to 17.7%, with models completely abandoning constraints in 85 tests.

WDCD Compliance Test AI模型衰减
323