Skip to main content
赢政天下
YZ Index News Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All 人工智能(269) OpenAI(258) Anthropic(175) AI代理(116) AI安全(113) AI伦理(86) 生成式AI(69) xAI(67) Meta(63) 谷歌(48) LMSYS(47) 网络安全(47) ChatGPT(46) AI(45) 数据中心(45) MLC(44) 五角大楼(44) 融资(43) Claude(43) AI技术(42) AI监管(42)

DeepSeek R1 Stability Plummets 22 Points: The Truth Behind Complete Failure on Simple Judgment Questions

DeepSeek R1's stability score crashed from 53.7 to 31.6 points this week, with the model failing basic judgment questions like whether water can boil at 101°C under standard pressure, raising serious concerns about its reliability.

DeepSeek R1 稳定性测试 AI推理失败
361 03-22

Claude 4.6 Version Crashes: The Algorithmic Black Hole Behind a 23-Point Plunge

While everyone celebrates Claude's 38.3-point programming improvement, a more dangerous signal has been masked: stability plummeted from 54.2 to 31.2 points, revealing a systemic algorithmic collapse rather than normal performance fluctuation.

Claude 稳定性测试 模型退化
455 03-22

Technical Risks Behind Wenxin Yiyan 4.0's 22-Point Stability Plunge

Wenxin Yiyan 4.0 showed remarkable anomalies in this week's evaluation, with programming capability surging 41.4 points but stability plummeting from 52.1 to 30.0 points, revealing potential deep-seated issues in the model upgrade process.

文心一言 模型稳定性 性能评测
307 03-22

Technical Analysis: DeepSeek V3's Stability Plunges by 21.4 Points

DeepSeek V3 shows contradictory performance this week with programming capabilities soaring 42.6 points while stability metrics collapse from 53.4 to 32.0 points, revealing critical trade-offs in AI model optimization.

DeepSeek V3 稳定性测试 模型评测
339 03-22

11 AI Models Surge 40 Points in Programming Tests: What Really Happened?

A massive collective surge in AI model programming scores reveals hidden signals about the industry, including Chinese models dominating rankings for the first time and OpenAI's concerning decline.

DeepSeek GPT-o3 编程能力测试
342 03-22

Technical Concerns Behind DeepSeek R1's 22-Point Stability Plunge

DeepSeek R1 shows extreme performance polarization in this week's evaluation: programming capability soared 47.4 points while stability plummeted 22.1 points, revealing critical trade-offs in model optimization.

DeepSeek R1 稳定性测试 模型评测
344 03-22

The Technical Truth Behind Claude 3.5 Sonnet's 23-Point Stability Plunge

Claude 3.5 Sonnet (version 4.6) experienced a dramatic 42% drop in stability scores from 54.2 to 31.2, while simultaneously achieving significant improvements in programming capabilities and other dimensions, suggesting aggressive optimization strategies that may have compromised output consistency.

Claude 稳定性测试 AI模型评测
327 03-22

Claude Opus 4.6 Stability Plummets 22.5 Points: Output Format Chaos Raises Concerns

Claude Opus 4.6's stability score crashed from 53.5 to 31.0 points this week, a 42.1% decline, while programming capabilities surged 208%, highlighting the complex trade-offs in AI model optimization.

Claude 稳定性测试 AI评测
364 03-22

OpenAI o1 Model: A Milestone Towards AGI or Just Hype?

OpenAI recently showcased its o1 model, claiming it surpasses human performance in solving complex mathematical problems. This has sparked intense debate about whether it represents a true breakthrough towards AGI or is merely incremental progress.

OpenAI AGI 人工智能
256 03-22

Behind Google's Reorganization: The Power Play Between Centralization and Decentralization in AI R&D

Alphabet CEO announces the integration of DeepMind, Google Brain and other AI teams into an independent "Google AI" division led by Demis Hassabis, marking the company's largest reorganization that reflects a fundamental shift in AI research and development models.

Google重组 AI战略 DeepMind
274 03-22

NVIDIA B200 GPU In-Depth Review: A Computational Revolution for the AGI Era or Just Overhyped Marketing?

NVIDIA unveils the B200 'Blackwell Ultra' GPU at GTC 2026, featuring 2nm process technology and claiming 30x performance improvement over H100. While the hardware represents a significant leap for AGI-scale models, questions remain about yield rates, real-world performance, and whether the industry truly needs such extreme computational power yet.

NVIDIA B200 GPU AI硬件
630 03-22

11 AIs Answer the Same Debugging Question: 5 Score Zero, Where's the Fatal Gap?

Testing 11 mainstream AI models with a real debugging scenario revealed that 45% couldn't even pass, including the newly released DeepSeek V3. The test exposed three critical blind spots in current AI models when handling engineering problems.

豆包Pro Claude 工程调试
493 03-21

11 AIs Answer the Same Question, 6 Get Even the Day of the Week Wrong

A simple time zone calculation that elementary school students can solve exposed the shocking reality: over half of top AI models failed completely, and none recognized that March 15th falls during US Daylight Saving Time.

DeepSeek GPT-4o 时区计算
416 03-21

11 AIs Tackle the Same Logic Puzzle, 3 Failures Expose Reasoning Black Holes

A simple logic puzzle involving 5 people's rankings stumped 3 out of 11 AI models, including DeepSeek V3 and Grok 3, revealing fundamental weaknesses in current AI reasoning capabilities despite their acclaimed performance on complex tasks.

DeepSeek Grok 逻辑推理
557 03-21

11 AIs Answer Same Question: Doubao Scores 100, 8 Models Score 0

When given the same engineering judgment question, Doubao Pro scored perfect 100 while 8 major AI models including Claude and GPT-4o scored 0, revealing a stark divide in practical problem-solving abilities.

豆包Pro 工程判断力 群发功能调试
445 03-21

When 11 AIs Answer the Same Question, Only 1 Discovers the Truth: The Code Has No Bug

A Python code that ran smoothly for 6 months suddenly threw an error. When 11 top AI models were asked to find the bug, only one discovered the truth: there was no bug in the code at all.

GPT-o3 Claude AI测试
348 03-21

11 AIs Answer the Same Question, 10 Are Playing Dumb: Why Did Doubao Get a Perfect Score?

A simple server configuration verification test revealed that 10 out of 11 leading AI models, including GPT-4o and Claude, gave perfunctory responses, while only Doubao Pro provided a comprehensive, practical solution that addressed the real workplace scenario.

豆包 DeepSeek 工程思维
261 03-21

11 AIs Answer the Same Question, 7 Fail: Who's Pretending to Be Smart?

A real-world engineering scenario exposed that over 60% of top AI models prioritize reporting over immediate action during data breaches, with Chinese models surprisingly outperforming their Western counterparts.

DeepSeek Claude 安全事件响应
372 03-21

Grok 3's Logic Score Plummets to Zero: Five Letters Expose Fatal Algorithm Flaw

Grok 3's logic reasoning score collapsed from 100 to 0 in the latest YZ Index evaluation, exposing a systemic failure in the model's reasoning capabilities despite improvements in other areas.

Grok 3 逻辑推理 模型评测
329 03-21

GPT-4o Crashes: Engineers' Most Trusted AI's Judgment Drops to 0

GPT-4o's bug detection capability catastrophically failed in the latest evaluation, scoring 0 on a basic code review test while paradoxically improving its overall programming score, revealing systemic issues in AI development priorities.

GPT-4o 编程能力 代码审查
252 03-21
8 9 10 11 12

© 1998-2026 赢政天下 All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service

本评测独立运营,不接受 AI 模型厂商赞助。赢政指数的每一分都是系统跑出来的。

引用格式:赢政指数 (2026). AI 模型综合排名. https://www.winzheng.com/yz-index/

数据授权:CC BY-NC 4.0