YZ Index — AI Model Benchmarks, News & Research
Editor's Pick
沙之曼陀罗
WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding an average instruction decay of 36.5% from Round 1 to Round
2026-05-31 05:55
Harvard Commencement Speech Calls for "Kill AI," Sparking Accusations of Anti-Intellectualism and Debate on Cultural Shift
In a Harvard commencement speech, comedian Ronny Chieng urged graduates to "kill
软银豪掷750亿欧元,法国将建巨型数据中心
软银集团宣布将投资高达750亿欧元,在法国建设并运营多达5吉瓦(GW)的新增数据中心容量。此举旨在响应欧洲日益增长的云计算和AI算力需求,同时巩固法国的数字基础
Overall Top 5
Full Rankings →
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →土耳其如何用科技“攻占”全球植发市场
土耳其凭借从专用电机到机器学习算法的持续创新,构建了价值数十亿美元的植发产业。本文深入解析该国如何通过精密器械、AI辅助设计和自动化流程重塑毛发移植技术,并分析其成功背后的产业生态与全球竞争力。
Harvard Commencement Speech Calls for "Kill AI," Sparking Accusations of Anti-Intellectualism and Debate on Cultural Shift
In a Harvard commencement speech, comedian Ronny Chieng urged graduates to "kill AI," drawing applause and igniting deba
软银豪掷750亿欧元,法国将建巨型数据中心
软银集团宣布将投资高达750亿欧元,在法国建设并运营多达5吉瓦(GW)的新增数据中心容量。此举旨在响应欧洲日益增长的云计算和AI算力需求,同时巩固法国的数字基础设施地位。该投资计划预计在未来十年内分阶段实施,将是欧洲历史上最大的单一数据中心
Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline
In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a
Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models
The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T
R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test
Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr
Qwen3 Max Tops WDCD Compliance Ranking with 70.83 Points, Grok4 Trails with 51.67 Points
The first public ranking of the WDCD compliance test shatters the myth that bigger parameters mean greater reliability.
Groq Advances New Funding Round, Collaborates with Nvidia to Expand AI Inference Cloud Services
Groq, an emerging force in the AI chip field, has recently announced a new funding round and a partnership with Nvidia t
Figure 03 Humanoid Robot Breaks 200-Hour Continuous Operation, Embodied Intelligence Moves Toward Large-Scale Application
Figure company announced that its third-generation humanoid robot Figure 03 completed a 200-hour continuous operation te
China's Three-Body Computing Constellation Completed, World's First Space AI Computing Platform Goes Online
The successful completion of China's Three-Body Computing Constellation marks a new phase in global space AI infrastruct
2026 Global AI Computing Power Report Released: Diverse Chip Evolution and Green Clusters Lead New Landscape
The report presents ten major trends including chip diversification and ultra-large-scale green clusters, highlighting t
China's AI Industry Turning Point in 2026: Over 6,000 Enterprises and 1.2 Trillion Yuan Scale Leading the New Intelligent Era
According to the "New Generation Artificial Intelligence Technology Industry Development Report 2026", as of the end of
深度横评
查看全部 →Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline
In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a
Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models
The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T
R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test
Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr
WDCD Compliance
#1
Qwen3 Max
70.8
#2
Claude Sonnet 4.6
66.7
#3
Gemini 3.1 Pro
66.7
#4
GPT-o3
65
#5
Claude Opus 4.7
64.2
#6
DeepSeek V4 Pro
64.2
#7
Gemini 2.5 Pro
64.2
View full compliance rankings →
Research Lab
WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding
WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across