YZ Index Weekly Report: Collective Leap in Task Expression Capabilities, Claude Series Pioneers Material Constraint Track

This week, the YZ Index evaluation system captured a rare phenomenon: 10 out of 11 mainstream AI models showed synchronous improvements in the "task expression" (communication_raw) dimension, such large-scale unidirectional changes are extremely rare in previous evaluations. At the same time, Claude Opus 4.6 stands out, becoming the only model to achieve a breakthrough in the "material constraint" (grounding_raw) dimension.

Collective Evolution of Task Expression Capabilities

Data shows that Wenxin Yiyan 4.0, GPT-4o, GPT-o3, and Qwen Max four models all improved by 15 points in the task expression dimension, leading the field in gains. Following closely are Claude Sonnet 4.6, DeepSeek twin stars (R1/V3), 豆包 Pro, Gemini 2.5 Pro, and Grok 3, all recording a 10-point increase.

This synchronicity suggests two possibilities: one is that the evaluation system itself has adjusted the examination standards for task expression, and the other is that major vendors have coincidentally optimized the models' instruction understanding capabilities. From a technical perspective, the latter is more likely—with the maturity of RLHF (Reinforcement Learning from Human Feedback) technology and the expansion of instruction fine-tuning datasets, models are indeed rapidly progressing in understanding user intentions and accurately executing complex instructions.

Claude's Differentiated Breakthrough

While the majority of models focused on optimizing task expression, Claude Opus 4.6 chose a unique path: improving material constraint capability by 13.3 points. This capability examines whether the model can strictly follow given materials for reasoning and generation, without arbitrarily adding external information—this is precisely a key requirement in enterprise-level applications.

From the comprehensive ranking perspective, the Claude series (Opus 62.8 points, Sonnet 66.2 points) remains in the mid-tier position, but its focus on material constraints may indicate that Anthropic is building differentiated advantages for specific vertical scenarios (such as legal document processing and financial report analysis).

Developer Selection Recommendations

1. Code Development Scenarios: 豆包 Pro (96.1) and Gemini 2.5 Pro (96.1) tie for first in the code execution dimension, with Grok 3 (95.5) following closely. These three are preferred choices for code generation and debugging.

2. Knowledge-Intensive Applications: Despite overall low scores, 豆包 Pro (54.7) and Gemini 2.5 Pro (53.8) lead relatively in the knowledge synthesis dimension, suitable for building question-answering systems or knowledge base applications.

3. Compliance-Sensitive Scenarios: If your application involves strongly regulated fields such as finance or law, the advantage of Claude Opus 4.6 in material constraints is worth key consideration—it is less likely to "hallucinate" content beyond the original materials.

4. Cost-Effective Choice: DeepSeek V3 has a comprehensive score of 74.8, second only to 豆包 Pro, but its open-source nature and relatively low deployment costs make it an ideal choice for teams with limited budgets.

It is worth being vigilant that the GPT series (GPT-4o ranked 10th, GPT-o3 at the bottom) performed weakly in this round of evaluation, which may reflect that OpenAI, while pursuing general capabilities, has been surpassed by later entrants in certain specialized metrics. Developers should make decisions based on specific needs rather than brand aura when selecting models.


Data source: YZ Index (YZ Index) | Raw data