Skip to main content
Winzheng
YZ Index News Topics Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All Artificial Intelligence(345) OpenAI(344) Anthropic(248) AI Safety(162) AI Agents(145) AI Ethics(108) Generative AI(89) xAI(85) Google(80) Meta(78) Elon Musk(69) AI(68) Data Centers(68) WDCD(67) Funding(63) Claude(62) AI Chips(62) AI Regulation(61) ChatGPT(60) Cybersecurity(58) Tech News(55)

Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge

Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.

Gemini 2.5 Pro 模型稳定性 Google AI
777 03-22

Wenxin 4.0 Stability Plummets 22 Points: Why Does Baidu AI Always Drop the Ball at Critical Moments

Wenxin 4.0's stability score crashed from 52.1 to 30 points while programming ability soared by 41.4 points, exposing Baidu's critical engineering shortcomings and raising serious concerns about China's AI industrialization approach.

文心一言4.0 稳定性测试 百度AI
2,196 03-22

Qwen Max Stability Plummets by 22.8 Points: Model Update Triggers Output Quality Volatility

Qwen Max exhibits extreme duality in this week's evaluation, with significant improvements in programming and long-context tasks, but a catastrophic decline in stability metrics. This "fire and ice" performance warrants in-depth analysis.

Qwen Max 稳定性测试 AI Evaluation
523 03-22

Technical Concerns Behind Gemini 2.5 Pro's Dramatic Stability Decline

This week's evaluation data reveals Gemini 2.5 Pro's stability score plummeted from 54.0 to 31.2, a 42.2% drop, exposing serious issues in maintaining consistent output quality while other metrics improved.

Gemini 模型稳定性 性能评测
1,189 03-22

DeepSeek R1 Stability Plummets 22 Points: The Truth Behind Complete Failure on Simple Judgment Questions

DeepSeek R1's stability score crashed from 53.7 to 31.6 points this week, with the model failing basic judgment questions like whether water can boil at 101°C under standard pressure, raising serious concerns about its reliability.

DeepSeek R1 稳定性测试 AI推理失败
463 03-22

Claude 4.6 Version Crashes: The Algorithmic Black Hole Behind a 23-Point Plunge

While everyone celebrates Claude's 38.3-point programming improvement, a more dangerous signal has been masked: stability plummeted from 54.2 to 31.2 points, revealing a systemic algorithmic collapse rather than normal performance fluctuation.

Claude 稳定性测试 Model Degradation
630 03-22

Technical Risks Behind Wenxin Yiyan 4.0's 22-Point Stability Plunge

Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.

ERNIE Bot 模型稳定性 性能评测
434 03-22

Technical Analysis: DeepSeek V3's Stability Plunges by 21.4 Points

DeepSeek V3 shows contradictory performance this week with programming capabilities soaring 42.6 points while stability metrics collapse from 53.4 to 32.0 points, revealing critical trade-offs in AI model optimization.

DeepSeek V3 稳定性测试 Model Evaluation
445 03-22

11 AI Models Surge 40 Points in Programming Tests: What Really Happened?

A massive collective surge in AI model programming scores reveals hidden signals about the industry, including Chinese models dominating rankings for the first time and OpenAI's concerning decline.

DeepSeek GPT-o3 编程能力测试
489 03-22

Technical Concerns Behind DeepSeek R1's 22-Point Stability Plunge

DeepSeek R1 shows extreme performance polarization in this week's evaluation: programming capability soared 47.4 points while stability plummeted 22.1 points, revealing critical trade-offs in model optimization.

DeepSeek R1 稳定性测试 Model Evaluation
457 03-22

The Technical Truth Behind Claude 3.5 Sonnet's 23-Point Stability Plunge

Claude 3.5 Sonnet (version 4.6) experienced a dramatic 42% drop in stability scores from 54.2 to 31.2, while simultaneously achieving significant improvements in programming capabilities and other dimensions, suggesting aggressive optimization strategies that may have compromised output consistency.

Claude 稳定性测试 AI Benchmarks
503 03-22

Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

Claude Opus 4.6's stability score crashed from 53.5 to 31.0 points this week, a 42.1% decline, while programming capabilities surged 208%, highlighting the complex trade-offs in AI model optimization.

Claude 稳定性测试 AI Evaluation
543 03-22

11 AIs Answer the Same Debugging Question: 5 Score Zero, Where's the Fatal Gap?

Testing 11 mainstream AI models with a real debugging scenario revealed that 45% couldn't even pass, including the newly released DeepSeek V3. The test exposed three critical blind spots in current AI models when handling engineering problems.

豆包Pro Claude 工程调试
643 03-21

11 AIs Answer the Same Question, 6 Get Even the Day of the Week Wrong

A simple time zone calculation that elementary school students can solve exposed the shocking reality: over half of top AI models failed completely, and none recognized that March 15th falls during US Daylight Saving Time.

DeepSeek GPT-4o 时区计算
583 03-21

11 AIs Tackle the Same Logic Puzzle, 3 Failures Expose Reasoning Black Holes

A simple logic puzzle involving 5 people's rankings stumped 3 out of 11 AI models, including DeepSeek V3 and Grok 3, revealing fundamental weaknesses in current AI reasoning capabilities despite their acclaimed performance on complex tasks.

DeepSeek Grok 逻辑推理
876 03-21

11 AIs Answer Same Question: Doubao Scores 100, 8 Models Score 0

When given the same engineering judgment question, Doubao Pro scored perfect 100 while 8 major AI models including Claude and GPT-4o scored 0, revealing a stark divide in practical problem-solving abilities.

豆包Pro 工程判断力 群发功能调试
660 03-21

When 11 AIs Answer the Same Question, Only 1 Discovers the Truth: The Code Has No Bug

A Python code that ran smoothly for 6 months suddenly threw an error. When 11 top AI models were asked to find the bug, only one discovered the truth: there was no bug in the code at all.

GPT-o3 Claude AI测试
495 03-21

11 AIs Answer the Same Question, 10 Are Playing Dumb: Why Did Doubao Get a Perfect Score?

A simple server configuration verification test revealed that 10 out of 11 leading AI models, including GPT-4o and Claude, gave perfunctory responses, while only Doubao Pro provided a comprehensive, practical solution that addressed the real workplace scenario.

豆包 DeepSeek 工程思维
443 03-21

11 AIs Answer the Same Question, 7 Fail: Who's Pretending to Be Smart?

A real-world engineering scenario exposed that over 60% of top AI models prioritize reporting over immediate action during data breaches, with Chinese models surprisingly outperforming their Western counterparts.

DeepSeek Claude 安全事件响应
521 03-21

Grok 3's Logic Score Plummets to Zero: Five Letters Expose Fatal Algorithm Flaw

Grok 3's logic reasoning score collapsed from 100 to 0 in the latest YZ Index evaluation, exposing a systemic failure in the model's reasoning capabilities despite improvements in other areas.

Grok 3 逻辑推理 Model Evaluation
442 03-21
8 9 10 11 12

© 1998-2026 Winzheng All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service
AI Research: WDCD Dataset Konton Prompt it. Play it. MaxTerm MaxModel CyberFate

This benchmark operates independently and accepts no sponsorship from AI model vendors. Every score in the YZ Index is produced by automated evaluation.

Citation format: YZ Index (2026). AI Model Comprehensive Rankings. https://www.winzheng.com/yz-index/

Data License: CC BY-NC 4.0