YZ Index Weekly Report: Collective Decline in Knowledge Work Capabilities, Claude Remains Stable Against the Trend

Mar 20, 2026 588 Views - Read Source winzheng.com

YZ Index 周报 AI Evaluation 2026-W12

This week's (2026-W12) YZ Index evaluation reveals a rare phenomenon of collective decline in knowledge work capabilities across the AI model market. Among 8 mainstream models, 6 experienced varying degrees of performance degradation in the knowledge work dimension, with GPT-o3 plummeting 12.1 points, marking the largest single-metric drop in recent times.

Core Finding: Widespread Degradation in Knowledge Work Capabilities

The data shows this week's decline in knowledge work capabilities follows a gradient distribution: GPT-o3 (-12.1) > Qwen Max (-9.8) > DeepSeek V3 (-7.1) > GPT-4o (-6.1) > Claude Opus 4.6 (-1.7). This widespread performance degradation may stem from recent adjustments in model update strategies by various vendors, or sacrificing some knowledge retrieval capabilities to optimize inference costs.

Notably, Claude Sonnet 4.6 emerges as the only model with positive growth this week, with stability improving by 3.8 points. In an overall downward environment, Anthropic's stability optimization strategy has clearly proven effective.

Ranking Landscape: DeepSeek Duo Leads, But Advantage Shrinking

Although DeepSeek V3 and R1 still occupy the top two positions, their leading advantage is being eroded. After DeepSeek V3's knowledge work capability dropped 7.1 points to just 75.5 points, the gap with third-place Claude Sonnet 4.6 narrowed from last week's 5 points to 2.3 points. Particularly concerning is DeepSeek R1's 7-point drop in stability, a dangerous signal for a model known for its reasoning prowess.

GPT-o3's performance is disappointing, with a comprehensive score of only 65.7 points, having fallen out of the first tier. Its knowledge work capability plummeted from 82.4 to 70.3 points, even lower than sixth-ranked Qwen Max (71.8 points).

Developer Selection Recommendations

1. Top Choice for Programming Tasks: Gemini 2.5 Pro (90.7 points) and Claude Sonnet 4.6 (88.5 points) perform best in the programming dimension and remain relatively stable.

2. Knowledge-Intensive Applications: Recommend using Claude Opus 4.6 (91.0 points) or Claude Sonnet 4.6 (89.8 points), which maintain leadership in knowledge work dimension with minimal degradation this week.

3. Overall Value: DeepSeek V3 remains a decent choice, but it's advisable to closely monitor its subsequent updates to avoid further performance decline affecting production environments.

4. Pitfall Alert: Currently not recommended to use GPT-o3 and Qwen Max in production environments, as their significant performance drops may lead to user experience issues.

This week's evaluation results remind us: AI model performance is not monotonically increasing, and regular evaluation and dynamic selection are necessary measures to ensure application quality.

Data Source: YZ Index | Run #20 | View Raw Data

YZ Index Weekly Report: Collective Decline in Knowledge Work Capabilities, Claude Remains Stable Against the Trend

Core Finding: Widespread Degradation in Knowledge Work Capabilities

Ranking Landscape: DeepSeek Duo Leads, But Advantage Shrinking

Developer Selection Recommendations

Related Reviews

Winzheng Index GPT-o3 Tops with 91.29 Points: 2026-07-27 Smoke Quick Test Data Brief

Winzheng Index DeepSeek V4 Pro Tops with 83.23: 2026-07-26 Smoke Quick Test Data Brief

Winzheng Index Claude Sonnet 4.6 and Grok 4 Tie at 96.98: 2026-07-25 Smoke Test Data Brief

Winzheng Index Grok 4 Leads with 84.21 Points: 2026-07-24 Smoke Quick Test Data Brief