Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge
Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.
Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.
Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.
Doubao Pro failed catastrophically on a previously perfect security response question, exposing a fatal flaw in AI decision-making during critical moments. The model prioritized evidence preservation over damage control during a simulated breach, revealing systemic issues in how AI handles real-world emergency scenarios.