This week's (2026-W12) YZ Index evaluation reveals a rare phenomenon of collective decline in knowledge work capabilities across the AI model market. Among 8 mainstream models, 6 experienced varying degrees of performance degradation in the knowledge work dimension, with GPT-o3 plummeting 12.1 points, marking the largest single-metric drop in recent times.
Core Finding: Widespread Degradation in Knowledge Work Capabilities
The data shows this week's decline in knowledge work capabilities follows a gradient distribution: GPT-o3 (-12.1) > Qwen Max (-9.8) > DeepSeek V3 (-7.1) > GPT-4o (-6.1) > Claude Opus 4.6 (-1.7). This widespread performance degradation may stem from recent adjustments in model update strategies by various vendors, or sacrificing some knowledge retrieval capabilities to optimize inference costs.
Notably, Claude Sonnet 4.6 emerges as the only model with positive growth this week, with stability improving by 3.8 points. In an overall downward environment, Anthropic's stability optimization strategy has clearly proven effective.
Ranking Landscape: DeepSeek Duo Leads, But Advantage Shrinking
Although DeepSeek V3 and R1 still occupy the top two positions, their leading advantage is being eroded. After DeepSeek V3's knowledge work capability dropped 7.1 points to just 75.5 points, the gap with third-place Claude Sonnet 4.6 narrowed from last week's 5 points to 2.3 points. Particularly concerning is DeepSeek R1's 7-point drop in stability, a dangerous signal for a model known for its reasoning prowess.
GPT-o3's performance is disappointing, with a comprehensive score of only 65.7 points, having fallen out of the first tier. Its knowledge work capability plummeted from 82.4 to 70.3 points, even lower than sixth-ranked Qwen Max (71.8 points).
Developer Selection Recommendations
1. Top Choice for Programming Tasks: Gemini 2.5 Pro (90.7 points) and Claude Sonnet 4.6 (88.5 points) perform best in the programming dimension and remain relatively stable.
2. Knowledge-Intensive Applications: Recommend using Claude Opus 4.6 (91.0 points) or Claude Sonnet 4.6 (89.8 points), which maintain leadership in knowledge work dimension with minimal degradation this week.
3. Overall Value: DeepSeek V3 remains a decent choice, but it's advisable to closely monitor its subsequent updates to avoid further performance decline affecting production environments.
4. Pitfall Alert: Currently not recommended to use GPT-o3 and Qwen Max in production environments, as their significant performance drops may lead to user experience issues.
This week's evaluation results remind us: AI model performance is not monotonically increasing, and regular evaluation and dynamic selection are necessary measures to ensure application quality.
Data Source: YZ Index | Run #20 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接