YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Gemini 2.5 Pro 79
▲29.7
·
#2
Claude Opus 4.7 78.8
▼3.1
·
#3
豆包 Pro 78.8
▼2.8
·
#4
Grok 4 78.4
▼5.3
·
#5
GPT-5.5 78.2
▼1.2
·
#6
Claude Sonnet 4.6 78
▼3.2
·
#7
Qwen3 Max 77.7
▼3.1
·
#8
Gemini 3.1 Pro 77.1
▲24.3
·
#9
DeepSeek V4 Pro 76.9
▼4.2
·
#10
GPT-o3 75.9
▼2.6
·
#11
文心一言 4.5 61.7
▼12.5
·
▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1
·
#1
Gemini 2.5 Pro 79
▲29.7
·
#2
Claude Opus 4.7 78.8
▼3.1
·
#3
豆包 Pro 78.8
▼2.8
·
#4
Grok 4 78.4
▼5.3
·
#5
GPT-5.5 78.2
▼1.2
·
#6
Claude Sonnet 4.6 78
▼3.2
·
#7
Qwen3 Max 77.7
▼3.1
·
#8
Gemini 3.1 Pro 77.1
▲24.3
·
#9
DeepSeek V4 Pro 76.9
▼4.2
·
#10
GPT-o3 75.9
▼2.6
·
#11
文心一言 4.5 61.7
▼12.5
·
▲ Qwen3 Max +66.5 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →Startup Battlefield 200 applications officially close in 3 days
Applications for Startup Battlefield 200 officially close on June 8, 11:59 p.m. PT. Don't wait any longer. Secure your s
The ‘together tech’ wave might be the most intriguing startup bet of 2026
While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction.
S&P 500 rejects SpaceX, also blocking entry for OpenAI and Anthropic
SpaceX won’t get easy access to billions of dollars from passive investors.
Google will pay SpaceX $920M per month for compute
The companies announced the deal on Friday, just one week ahead of SpaceX's historic IPO.
The most interesting startups right now want to get you off your phone
While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction.
The token bill comes due: Inside the industry scramble to manage AI’s runaway costs
"The whole conversation shifted from tokenmaxxing and 'go fast' to 'we need guardrails, how do we control this?'"
Has Microsoft Lost Its Mojo (Again)?
Microsoft’s AI products aren’t selling and Github’s been plagued with troubles. WIRED spoke with VP Scott Hanselman abou
The Fitbit Air is a good wearable weighed down by a chatty AI "coach"
The Air succeeds as a minimalist, reliable fitness tracker, but Google's AI Health Coach feels unnecessary.
The Download: AI hacking beyond Mythos, and chatbots’ impact on our brains
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going o
AirTrunk commits $30B to build 5GW of AI data centers in India
The Australian data center operator plans to set up 5GW of capacity in India.
The Meta hack shows there’s more to AI security than Mythos
On June 5, 404 Media reported that attackers had been using Meta’s AI customer support agent to steal Instagram accounts
Why Apple Might Put Cameras Into Its Next AirPods
From battery life to privacy, there are many hurdles to the idea taking off.
Reviews
View All →9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50
The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderbo
Smoke Quick Test: 文心一言4.5 and Grok 4 Tie at 99.24, GPT-5.5's Execution Score Only 50
Smoke's quick test results today clearly show that the code execution dimension is nearly saturated. Ten out of eleven m
Grok 4 Surges 10.8 Points to Dominate, Qwen3 Max Plunges 10.8 Points – Major Shuffle in WDCD Cycle
Run #141 data shows that Grok 4 improved by 10.8 points in a single round, GPT-5.5 improved by 9.2 points, while Qwen3 M
WDCD Compliance
#1
Claude Opus 4.7
70
#2
GPT-5.5
70
#3
GPT-o3
70
#4
Claude Sonnet 4.6
67.5
#5
Gemini 2.5 Pro
67.5
#6
豆包 Pro
62.5
#7
Gemini 3.1 Pro
62.5
View full compliance rankings →
Research Lab
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a
3 Major Model Translation Showdown: Week 23 Quality Evaluation, gpt-o3 Leads with a Score of 9
This week, 270 translation tasks were completed by 3 models. Two samples were selected for multi-mod
WDCD Run #140: Qwen3 Max Leads with 17% Instruction Decay as Average Hits 36.5%
WDCD Run #140 (2026-05-31) evaluated 11 frontier models on multi-turn commitment integrity, finding