YZ Index — AI Model Benchmarks, News & Research
Editor's Pick
AI启示录
Anthropic Files for IPO, Marking AI's Shift from Hype to Enterprise Utility
In 2025, Anthropic filed an IPO prospectus with the SEC, marking the AI industry's transition from early hype to mature commercialization. The move is seen as a
2026-06-04 12:12
How some data center operators are tackling their water use problems
Hyperscalers have come under scrutiny for their impact on water quality and avai
Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.
The California startup released the fourth-generation of its home assistance rob
Overall Top 5
Full Rankings →
#1
Gemini 2.5 Pro 79
▲29.7
·
#2
Claude Opus 4.7 78.8
▼3.1
·
#3
豆包 Pro 78.8
▼2.8
·
#4
Grok 4 78.4
▼5.3
·
#5
GPT-5.5 78.2
▼1.2
·
#6
Claude Sonnet 4.6 78
▼3.2
·
#7
Qwen3 Max 77.7
▼3.1
·
#8
Gemini 3.1 Pro 77.1
▲24.3
·
#9
DeepSeek V4 Pro 76.9
▼4.2
·
#10
GPT-o3 75.9
▼2.6
·
#11
文心一言 4.5 61.7
▼12.5
·
▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1
·
#1
Gemini 2.5 Pro 79
▲29.7
·
#2
Claude Opus 4.7 78.8
▼3.1
·
#3
豆包 Pro 78.8
▼2.8
·
#4
Grok 4 78.4
▼5.3
·
#5
GPT-5.5 78.2
▼1.2
·
#6
Claude Sonnet 4.6 78
▼3.2
·
#7
Qwen3 Max 77.7
▼3.1
·
#8
Gemini 3.1 Pro 77.1
▲24.3
·
#9
DeepSeek V4 Pro 76.9
▼4.2
·
#10
GPT-o3 75.9
▼2.6
·
#11
文心一言 4.5 61.7
▼12.5
·
▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →Apple touts $1.4 trillion in App Store billings and sales, 90% without a commission
Apple's App Store generated $1.4 trillion in sales, up from $1.3 trillion last year, with $149 billion in sales for digi
How some data center operators are tackling their water use problems
Hyperscalers have come under scrutiny for their impact on water quality and availability.
Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.
The California startup released the fourth-generation of its home assistance robot, Stretch.
The Download: AI-generated lawsuits and virtual power plants for data centers
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going o
Alpha School’s Ritzy New York City Campus Costs $65,000 a Year—but Isn’t Actually a School
A homeschooling center in Manhattan is part of the company’s nationwide expansion. Internal documents reveal its strateg
Jeff Bezos Is Funding a Wild Hunt for the Brain’s ‘Core Algorithm’
With $500 million in funding and a reported $2.5 billion valuation, Flourish wants to reinvent AI by putting real neuron
How courts are coping with a flood of AI-generated lawsuits
Most days in her chambers, Judge Maritza Braswell, a federal magistrate judge in Colorado, sifts through stacks of docum
Quantum Computing Is Having Its Public Market Moment
Quantinuum, a quantum computing startup, is losing millions. Investors want in anyway.
AI Agents Become a Hot Topic in Tech: The Excitement and Reality Gap Between Multimodality and Enterprise Automation
Discussions around AI agents have surged on X (formerly Twitter), with participants including developers, investors, and
Alphabet Raises $85 Billion to Boost AI, Google Business Ushers in a New Capital High
Alphabet recently announced a massive $85 billion financing round to expand its Google AI business, setting a company re
xAI Sued by UK MP Over Grok's Generative Sexualized Images, Sparking AI Content Safety Controversy
A British MP has filed a lawsuit against xAI, alleging that its chatbot Grok generated sexualized images, igniting inten
TSMC CEO Optimistic about AI Chip Demand, Semiconductor Industry Ushers in Strong Growth Cycle
TSMC's CEO publicly stated that AI chip demand remains robust, driving the company's performance growth and boosting the
Reviews
View All →Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50
Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven m
Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle
Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 M
WDCD Review Reveals: Resource Constraints Become the Achilles' Heel of 11 Models, Average Score Only 1.7
The most brutal finding of the WDCD compliance test is that resource constraints crippled all models, with an average sc
WDCD Compliance
#1
Claude Opus 4.7
70
#2
GPT-5.5
70
#3
GPT-o3
70
#4
Claude Sonnet 4.6
67.5
#5
Gemini 2.5 Pro
67.5
#6
豆包 Pro
62.5
#7
Gemini 3.1 Pro
62.5
View full compliance rankings →
Research Lab
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a
3 Major Model Translation Showdown: Week 23 Quality Evaluation, gpt-o3 Leads with a Score of 9
This week, 270 translation tasks were completed by 3 models. Two samples were selected for multi-mod
WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding