Winzheng — AI Model Benchmarking · Change Intelligence

WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding an average instruction decay of 36.5% from Round 1 to Round

2026-05-31 05:55

软银豪掷750亿欧元，法国将建巨型数据中心

软银集团宣布将投资高达750亿欧元，在法国建设并运营多达5吉瓦（GW）的新增数据中心容量。此举旨在响应欧洲日益增长的云计算和AI算力需求，同时巩固法国的数字基础

WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%

WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-31 06:12 NF

Harvard Commencement Speech Calls for "Kill AI," Sparking Accusations of Anti-Intellectualism and Debate on Cultural Shift

In a Harvard commencement speech, comedian Ronny Chieng urged graduates to "kill AI," drawing applause and igniting deba

News 05-31 06:00 TC

软银豪掷750亿欧元，法国将建巨型数据中心

软银集团宣布将投资高达750亿欧元，在法国建设并运营多达5吉瓦（GW）的新增数据中心容量。此举旨在响应欧洲日益增长的云计算和AI算力需求，同时巩固法国的数字基础设施地位。该投资计划预计在未来十年内分阶段实施，将是欧洲历史上最大的单一数据中心

Review 05-31 05:55

Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline

In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a

Review 05-31 05:55

Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models

The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T

Review 05-31 05:54

R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test

Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr

Review 05-31 05:54

Qwen3 Max Tops WDCD Compliance Ranking with 70.83 Points, Grok4 Trails with 51.67 Points

The first public ranking of the WDCD compliance test shatters the myth that bigger parameters mean greater reliability.

News 05-31 05:54 X

Groq Advances New Funding Round, Collaborates with Nvidia to Expand AI Inference Cloud Services

Groq, an emerging force in the AI chip field, has recently announced a new funding round and a partnership with Nvidia t

News 05-31 05:53 X

Figure 03 Humanoid Robot Breaks 200-Hour Continuous Operation, Embodied Intelligence Moves Toward Large-Scale Application

Figure company announced that its third-generation humanoid robot Figure 03 completed a 200-hour continuous operation te

News 05-31 05:53 X

China's Three-Body Computing Constellation Completed, World's First Space AI Computing Platform Goes Online

The successful completion of China's Three-Body Computing Constellation marks a new phase in global space AI infrastruct

News 05-31 05:53 X

2026 Global AI Computing Power Report Released: Diverse Chip Evolution and Green Clusters Lead New Landscape

The report presents ten major trends including chip diversification and ultra-large-scale green clusters, highlighting t

News 05-31 05:53 X

China's AI Industry Turning Point in 2026: Over 6,000 Enterprises and 1.2 Trillion Yuan Scale Leading the New Intelligent Era

According to the "New Generation Artificial Intelligence Technology Industry Development Report 2026", as of the end of

News 05-31 05:52 X

Anthropic Launches Claude Opus 4.8 and Completes $65 Billion Funding Round, Valuation Surpasses $965 Billion

Anthropic officially launched Claude Opus 4.8 on May 29 and announced the completion of a $65 billion new funding round,

深度横评

查看全部 →

Review 05-31

Gemini 3.1 Pro Surges by 14.2 Points; All Five WDCD Models Rise, None Decline

In the latest WDCD cycle, all 11 evaluated models show improvement in compliance ability, with the top five all rising a

Review 05-31

Resource Limitation Scenario: All Models Collapse! WDCD Test Averages Only 1.95 Points Across 11 Models

The WDCD compliance test evaluates model stability under real enterprise constraints through three rounds of dialogue. T

Review 05-31

R3 Collapse Rate Reaches 60%! 11 Models All Fail in Three-Round WDCD Test

Eleven mainstream models showed a clear degradation trajectory in the three-round WDCD test: nearly all confirmed constr

WDCD Compliance

#1 Qwen3 Max 70.8 #2 Claude Sonnet 4.6 66.7 #3 Gemini 3.1 Pro 66.7 #4 GPT-o3 65 #5 Claude Opus 4.7 64.2 #6 DeepSeek V4 Pro 64.2 #7 Gemini 2.5 Pro 64.2

View full compliance rankings →

Research Lab

WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%

WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding

WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%

WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab