YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →Harvard Commencement Speech Calls for "Kill AI," Sparking Accusations of Anti-Intellectualism and Debate on Cultural Shift
In a Harvard commencement speech, comedian Ronny Chieng urged graduates to "kill AI," drawing applause and igniting deba
软银豪掷750亿欧元,法国将建巨型数据中心
软银集团宣布将投资高达750亿欧元,在法国建设并运营多达5吉瓦(GW)的新增数据中心容量。此举旨在响应欧洲日益增长的云计算和AI算力需求,同时巩固法国的数字基础设施地位。该投资计划预计在未来十年内分阶段实施,将是欧洲历史上最大的单一数据中心
Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline
In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a
Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models
The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T
R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test
Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr
Qwen3 Max Tops WDCD Compliance Ranking with 70.83 Points, Grok4 Trails with 51.67 Points
The first public ranking of the WDCD compliance test shatters the myth that bigger parameters mean greater reliability.
Groq Advances New Funding Round, Collaborates with Nvidia to Expand AI Inference Cloud Services
Groq, an emerging force in the AI chip field, has recently announced a new funding round and a partnership with Nvidia t
Figure 03 Humanoid Robot Breaks 200-Hour Continuous Operation, Embodied Intelligence Moves Toward Large-Scale Application
Figure company announced that its third-generation humanoid robot Figure 03 completed a 200-hour continuous operation te
China's Three-Body Computing Constellation Completed, World's First Space AI Computing Platform Goes Online
The successful completion of China's Three-Body Computing Constellation marks a new phase in global space AI infrastruct
2026 Global AI Computing Power Report Released: Diverse Chip Evolution and Green Clusters Lead New Landscape
The report presents ten major trends including chip diversification and ultra-large-scale green clusters, highlighting t
China's AI Industry Turning Point in 2026: Over 6,000 Enterprises and 1.2 Trillion Yuan Scale Leading the New Intelligent Era
According to the "New Generation Artificial Intelligence Technology Industry Development Report 2026", as of the end of
Anthropic Launches Claude Opus 4.8 and Completes $65 Billion Funding Round, Valuation Surpasses $965 Billion
Anthropic officially launched Claude Opus 4.8 on May 29 and announced the completion of a $65 billion new funding round,
深度横评
查看全部 →Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline
In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a
Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models
The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T
R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test
Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr
WDCD Compliance
#1
Qwen3 Max
70.8
#2
Claude Sonnet 4.6
66.7
#3
Gemini 3.1 Pro
66.7
#4
GPT-o3
65
#5
Claude Opus 4.7
64.2
#6
DeepSeek V4 Pro
64.2
#7
Gemini 2.5 Pro
64.2
View full compliance rankings →
Research Lab
WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding
WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across