Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge
Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.
Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.
Wenxin 4.0's stability score crashed from 52.1 to 30 points while programming ability soared by 41.4 points, exposing Baidu's critical engineering shortcomings and raising serious concerns about China's AI industrialization approach.
Qwen Max exhibits extreme duality in this week's evaluation, with significant improvements in programming and long-context tasks, but a catastrophic decline in stability metrics. This "fire and ice" performance warrants in-depth analysis.
This week's evaluation data reveals Gemini 2.5 Pro's stability score plummeted from 54.0 to 31.2, a 42.2% drop, exposing serious issues in maintaining consistent output quality while other metrics improved.
DeepSeek R1's stability score crashed from 53.7 to 31.6 points this week, with the model failing basic judgment questions like whether water can boil at 101°C under standard pressure, raising serious concerns about its reliability.
While everyone celebrates Claude's 38.3-point programming improvement, a more dangerous signal has been masked: stability plummeted from 54.2 to 31.2 points, revealing a systemic algorithmic collapse rather than normal performance fluctuation.
Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.
DeepSeek V3 shows contradictory performance this week with programming capabilities soaring 42.6 points while stability metrics collapse from 53.4 to 32.0 points, revealing critical trade-offs in AI model optimization.
A massive collective surge in AI model programming scores reveals hidden signals about the industry, including Chinese models dominating rankings for the first time and OpenAI's concerning decline.
DeepSeek R1 shows extreme performance polarization in this week's evaluation: programming capability soared 47.4 points while stability plummeted 22.1 points, revealing critical trade-offs in model optimization.
Claude 3.5 Sonnet (version 4.6) experienced a dramatic 42% drop in stability scores from 54.2 to 31.2, while simultaneously achieving significant improvements in programming capabilities and other dimensions, suggesting aggressive optimization strategies that may have compromised output consistency.
Claude Opus 4.6's stability score crashed from 53.5 to 31.0 points this week, a 42.1% decline, while programming capabilities surged 208%, highlighting the complex trade-offs in AI model optimization.
Testing 11 mainstream AI models with a real debugging scenario revealed that 45% couldn't even pass, including the newly released DeepSeek V3. The test exposed three critical blind spots in current AI models when handling engineering problems.
A simple time zone calculation that elementary school students can solve exposed the shocking reality: over half of top AI models failed completely, and none recognized that March 15th falls during US Daylight Saving Time.
A simple logic puzzle involving 5 people's rankings stumped 3 out of 11 AI models, including DeepSeek V3 and Grok 3, revealing fundamental weaknesses in current AI reasoning capabilities despite their acclaimed performance on complex tasks.
When given the same engineering judgment question, Doubao Pro scored perfect 100 while 8 major AI models including Claude and GPT-4o scored 0, revealing a stark divide in practical problem-solving abilities.
A Python code that ran smoothly for 6 months suddenly threw an error. When 11 top AI models were asked to find the bug, only one discovered the truth: there was no bug in the code at all.
A simple server configuration verification test revealed that 10 out of 11 leading AI models, including GPT-4o and Claude, gave perfunctory responses, while only Doubao Pro provided a comprehensive, practical solution that addressed the real workplace scenario.
A real-world engineering scenario exposed that over 60% of top AI models prioritize reporting over immediate action during data breaches, with Chinese models surprisingly outperforming their Western counterparts.
Grok 3's logic reasoning score collapsed from 100 to 0 in the latest YZ Index evaluation, exposing a systemic failure in the model's reasoning capabilities despite improvements in other areas.