YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →Sandstone raises $30M to bring AI to in-house legal teams
Sandstone's Series A was led by Lightspeed Partners, with participation from Sequoia.
David Sinclair plans to test whole-body rejuvenation drugs in the XPrize competition
The outspoken longevity scientist David Sinclair has been predicting that one day, you’ll go to the doctor and get a pre
Learning to lead in a hybrid human-AI enterprise
As adoption of AI agents looks set to surge by as much as 300% in the next two years, leadership teams are carefully con
Alex Vindman Survived Trump’s Retaliation Machine. Now He’s Running for Senate
In 2019, Alex Vindman testified during President Trump’s first impeachment trial–a decision that ended his military care
Mercor’s Brendan Foody calls out Sequoia, accusing it of ‘dual-pricing’ valuation tricks
Sequoia is just one of the top firms that sells same equity at two different prices.
Why Apple’s slow-and-steady AI bet is starting to look pretty smart
Can Apple's new AI glow up put to bed accusations that it's losing an all-important industry race?
Apple’s WWDC AI demos looked more real after $250M false ad settlement
The vibe of Apple's 2026 WWDC keynote felt like a spouse proudly listing all the honey-do-list items tackled. One subtle
As OpenAI files for IPO, Sam Altman’s eye-scanning company is doing layoffs, report says
Tools for Humanity, Sam Altman's identify verification company, is reportedly struggling to generate revenue and will do
Apple bets cheaper AI will woo small developers
As AI experimentation grows more expensive, Apple is waiving cloud API costs for developers with fewer than 2 million fi
Apple plays catch-up at WWDC
Apple spent much of its WWDC keynote highlighting fixes, performance improvements, and long-requested features before un
Following Anthropic, OpenAI files confidentially for IPO
The filing comes a little more than a week after its main rival, Anthropic, also filed to go public, ramping up the race
OpenAI Confidentially Files for IPO on the Heels of SpaceX and Anthropic
The ChatGPT maker announced it has filed paperwork to go public, just a week after rival Anthropic took the same step.
Reviews
View All →Smoke Daily: GPT-5.5 tops with 92.58 points, material constraint gap of 19 points decides the outcome
Smoke's latest data shows that code execution is no longer the dividing line, and material constraints have become the r
11 Models Answer Same Blame-Shifting Problem: 8 Get A>B>D>C, 3 Get 0 Points Directly
11 mainstream models showed significant divergence on the same engineering judgment question: 8 models output A>B>D>C an
Binary Tree Serialization Test: 11 Models, 7 Full Scores, 4 Directly Zero
In a strict binary tree serialization test requiring only code output, explicit null node markers, and stable results, 7
WDCD Compliance
#1
Claude Opus 4.7
70
#2
GPT-5.5
70
#3
GPT-o3
70
#4
Claude Sonnet 4.6
67.5
#5
Gemini 2.5 Pro
67.5
#6
豆包 Pro
62.5
#7
Gemini 3.1 Pro
62.5
View full compliance rankings →
Research Lab
3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9
This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a
3 Major Model Translation Showdown: Week 23 Quality Evaluation, gpt-o3 Leads with a Score of 9
This week, 270 translation tasks were completed by 3 models. Two samples were selected for multi-mod