Skip to main content
赢政天下
YZ Index News Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All 人工智能(268) OpenAI(257) Anthropic(175) AI代理(116) AI安全(113) AI伦理(86) 生成式AI(68) xAI(67) Meta(62) 谷歌(47) LMSYS(47) 网络安全(47) AI(45) 数据中心(45) ChatGPT(45) MLC(44) 五角大楼(44) Claude(43) AI技术(42) AI监管(42) 融资(42)

Doubao Pro's Stability Plummets by 19.8 Points, Inconsistent Responses Become Its Achilles' Heel

Doubao Pro's stability score crashed from 54.5 to 34.7 in the latest YZ Index evaluation, revealing serious consistency issues that could undermine user trust and enterprise adoption.

豆包Pro 稳定性 模型一致性
497 03-24

Grok 3 Stability Plummets 22.5 Points: When AI Meets Real Engineering Scenarios, The Truth Comes Out

Grok 3's stability score crashed from 54.2 to 31.7 points in the latest Winzheng evaluation, exposing a fatal weakness in current AI models that excel at coding but fail at real-world engineering judgment.

Grok 3 稳定性测试 工程判断力
600 03-22

GPT-o3 Crashes: The Fatal Flaws Behind a 31-Point Plunge

GPT-o3's availability score plummeted from 100 to 69 in just one week, exposing fundamental architectural defects rather than isolated issues—a technical accident that reveals systemic imbalances in AI development.

GPT-o3 可用性测试 模型稳定性
453 03-22

GPT-o3 Collapsed: Not Performance Fluctuation, But Systematic Architectural Breakdown

GPT-o3 suffered a catastrophic system failure with stability plummeting from 53 to 28 points and availability dropping from 100 to 69, revealing fundamental architectural flaws rather than typical performance variations.

GPT-o3 稳定性测试 模型架构
411 03-22

GPT-o3 Crashes: 5 Rate Limits in 30 Seconds, Long Context Score Plummets by 33.5 Points

GPT-o3's long context processing capability collapsed in recent testing, with scores dropping from 62.3 to 28.8 points due to aggressive API rate limiting, exposing serious infrastructure issues at OpenAI.

GPT-o3 长上下文 API限流
439 03-22

GPT-4o Crashes: The Strict Mode Trap Behind a 35-Point Plunge

GPT-4o experiences a catastrophic performance collapse with its usability score plummeting from 100 to 65, caused by overly conservative "strict tool calling" that makes the model refuse to perform basic tasks.

GPT-4o 可用性测试 严格模式
396 03-22

Technical Risks Behind Doubao Pro's Sharp Decline in Stability

Doubao Pro's stability score plummeted from 54.5 to 34.7 (a 36.3% drop) this week, despite significant improvements in programming and knowledge work dimensions, revealing a concerning pattern of "progress and regression coexisting" that warrants in-depth analysis.

豆包Pro 稳定性测试 AI评测
652 03-22

GPT-4o Crashes: 5 Failed Tests Expose OpenAI's Infrastructure Crisis

GPT-4o's catastrophic failure in long-context tests, with 5 questions returning rate limit errors, reveals OpenAI's severe infrastructure problems rather than model capability issues.

GPT-4o 长上下文 OpenAI基础设施
414 03-22

Gemini 2.5 Pro Crashes: Engineering Judgment Failure Behind 23-Point Stability Plunge

Gemini 2.5 Pro's stability score plummeted 22.8 points in one week, exposing a critical lack of engineering judgment despite gains in programming capabilities.

Gemini 2.5 Pro 模型稳定性 Google AI
559 03-22

Wenxin 4.0 Stability Plummets 22 Points: Why Does Baidu AI Always Drop the Ball at Critical Moments

Wenxin 4.0's stability score crashed from 52.1 to 30 points while programming ability soared by 41.4 points, exposing Baidu's critical engineering shortcomings and raising serious concerns about China's AI industrialization approach.

文心一言4.0 稳定性测试 百度AI
449 03-22

Qwen Max Stability Plummets by 22.8 Points: Model Update Triggers Output Quality Volatility

Qwen Max exhibits extreme duality in this week's evaluation, with significant improvements in programming and long-context tasks, but a catastrophic decline in stability metrics. This "fire and ice" performance warrants in-depth analysis.

Qwen Max 稳定性测试 AI评测
390 03-22

Technical Concerns Behind Gemini 2.5 Pro's Dramatic Stability Decline

This week's evaluation data reveals Gemini 2.5 Pro's stability score plummeted from 54.0 to 31.2, a 42.2% drop, exposing serious issues in maintaining consistent output quality while other metrics improved.

Gemini 模型稳定性 性能评测
465 03-22

DeepSeek R1 Stability Plummets 22 Points: The Truth Behind Complete Failure on Simple Judgment Questions

DeepSeek R1's stability score crashed from 53.7 to 31.6 points this week, with the model failing basic judgment questions like whether water can boil at 101°C under standard pressure, raising serious concerns about its reliability.

DeepSeek R1 稳定性测试 AI推理失败
358 03-22

Claude 4.6 Version Crashes: The Algorithmic Black Hole Behind a 23-Point Plunge

While everyone celebrates Claude's 38.3-point programming improvement, a more dangerous signal has been masked: stability plummeted from 54.2 to 31.2 points, revealing a systemic algorithmic collapse rather than normal performance fluctuation.

Claude 稳定性测试 模型退化
451 03-22

Technical Risks Behind Wenxin Yiyan 4.0's 22-Point Stability Plunge

Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.

文心一言 模型稳定性 性能评测
306 03-22

Technical Analysis: DeepSeek V3's Stability Plunges by 21.4 Points

DeepSeek V3 shows contradictory performance this week with programming capabilities soaring 42.6 points while stability metrics collapse from 53.4 to 32.0 points, revealing critical trade-offs in AI model optimization.

DeepSeek V3 稳定性测试 模型评测
338 03-22

11 AI Models Surge 40 Points in Programming Tests: What Really Happened?

A massive collective surge in AI model programming scores reveals hidden signals about the industry, including Chinese models dominating rankings for the first time and OpenAI's concerning decline.

DeepSeek GPT-o3 编程能力测试
340 03-22

Technical Concerns Behind DeepSeek R1's 22-Point Stability Plunge

DeepSeek R1 shows extreme performance polarization in this week's evaluation: programming capability soared 47.4 points while stability plummeted 22.1 points, revealing critical trade-offs in model optimization.

DeepSeek R1 稳定性测试 模型评测
342 03-22

The Technical Truth Behind Claude 3.5 Sonnet's 23-Point Stability Plunge

Claude 3.5 Sonnet (version 4.6) experienced a dramatic 42% drop in stability scores from 54.2 to 31.2, while simultaneously achieving significant improvements in programming capabilities and other dimensions, suggesting aggressive optimization strategies that may have compromised output consistency.

Claude 稳定性测试 AI模型评测
327 03-22

Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

Claude Opus 4.6's stability score crashed from 53.5 to 31.0 points this week, a 42.1% decline, while programming capabilities surged 208%, highlighting the complex trade-offs in AI model optimization.

Claude 稳定性测试 AI评测
364 03-22
1 2 3 4

© 1998-2026 赢政天下 All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service

本评测独立运营,不接受 AI 模型厂商赞助。赢政指数的每一分都是系统跑出来的。

引用格式:赢政指数 (2026). AI 模型综合排名. https://www.winzheng.com/yz-index/

数据授权:CC BY-NC 4.0