AI测评 - AI News | 赢政天下

Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge

Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.

Technical Risks Behind Wenxin Yiyan 4.0's 22-Point Stability Plunge

Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.

Doubao Pro Scores Zero on Perfect Question: Why AI Models Collectively Fall Silent During Real Security Incidents

Doubao Pro failed catastrophically on a previously perfect security response question, exposing a fatal flaw in AI decision-making during critical moments. The model prioritized evidence preservation over damage control during a simulated breach, revealing systemic issues in how AI handles real-world emergency scenarios.

AI测评 (3 articles)

Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge

Technical Risks Behind Wenxin Yiyan 4.0's 22-Point Stability Plunge

Doubao Pro Scores Zero on Perfect Question: Why AI Models Collectively Fall Silent During Real Security Incidents