YZ Index — AI Model Benchmarks, News & Research
Editor's Pick
Anthropic’s Claude Fable 5 is a version of Mythos the public can access today
Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardrails that block responses in high-risk a
2026-06-10 06:02
Google just fired a warning shot in the AI subscription price wars
Google just made it significantly cheaper to enjoy its budget AI subscription ti
How Justin Ernest invested nearly $400M into hot startups without a traditional VC fund
Instead of spending a year raising a formal venture fund, the Sabertooth VC foun
Overall Top 5
Full Rankings →
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →How Justin Ernest invested nearly $500M into hot startups without a traditional VC fund
Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to inv
Google just fired a warning shot in the AI subscription price wars
Google just made it significantly cheaper to enjoy its budget AI subscription tier.
How Justin Ernest invested nearly $400M into hot startups without a traditional VC fund
Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to inv
Anthropic’s Fable 5 can make weirdly fun video games with the click of a button
Anthropic's Claude Fable 5 is going to be a big hit with the web's vibe coders.
Hey Siri, here’s what I actually want from AI
I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without
WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top
WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude Sonnet 4.6, Gemini 2.
WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies
In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum d
11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap
WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro onl
R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models
The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rat
67.5 Points Three-Way Tie for First, Grok4 Only 50 Points at Bottom - WDCD Compliance Leaderboard
The first results of the WDCD Compliance Test are out, with three models tied for first at 67.50 points, while Grok 4 an
WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence, and more
Apple primarily made the case for an improved experience with its long-standing Siri assistant, which like most other an
Can tech companies learn to love cheaper AI models?
If those same AI workloads can be handled by cheaper models without affecting quality, it would mean a massive shift in
Reviews
View All →WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies
In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum d
11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap
WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro onl
R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models
The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rat
WDCD Compliance
#1
Claude Sonnet 4.6
67.5
#2
Gemini 2.5 Pro
67.5
#3
Qwen3 Max
67.5
#4
GPT-o3
65
#5
Claude Opus 4.7
62.5
#6
Gemini 3.1 Pro
60
#7
GPT-5.5
57.5
View full compliance rankings →
Research Lab
WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top
WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude S
3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9
This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a