YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →Wrongful Arrest Exposes Failures in One of the Oldest Police Face-Recognition Tools in the US
The ACLU is suing two Florida police departments over the arrest of a Fort Myers man in a child-abduction case, saying o
Warner Music acquires AI attribution startup Sureel AI
Through the acquisition, WMG aims to better track when its artists' work is used in AI-generated content or for training
The three hard-tech moonshots fueling SpaceX’s unbelievable IPO
Most of the value in SpaceX's IPO is effectively a call option on the company's ambitious space data center plans.
Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in
AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies
Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable
Cybersecurity researchers are complaining that Anthropic's new model Fable has guardrails that are too strict for any cy
The Download: the “steroid olympics” and a safer Mythos
This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going o
Decart’s new world model can simulate hours of photorealistic driving — with some caveats
Decart is launching Oasis 3, a real-time world model that generates photorealistic driving environments for autonomous v
Jedify raises $24M to help companies arm AI agents with context on their business
The funding round was led by Norwest, with participation S Capital VC, Cerca Partners, and Oceans Ventures. Snowflake Ve
China Opens World’s First Wind-Powered Underwater Data Center
With an initial capacity of 24 megawatts, the innovative data center uses seawater as a natural cooling system.
Artificial Intelligence Sneaks Into the World Cup Thanks to Google Gemini
The Argentine national team will be Google’s test bench and technological showcase during the World Cup.
Meta signs first AI data center deal in India with Reliance
The 168-megawatt facility will support Meta's global AI computing needs and can be expanded over time.
How Justin Ernest invested nearly $500M into hot startups without a traditional VC fund
Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to inv
Reviews
View All →WDCD Compliance Test Shakes: 5 Models Plunge Up to 12.5 Points, Qwen3 Max Rallies
In the latest WDCD cycle compared to Run #146, five mainstream models experienced significant declines, with a maximum d
11 Models WDCD Horizontal Review: Resource Constraints All Collapse to 1 Point, Business Rules Show 4-Point Gap
WDCD pilot data shows that the Resource Constraints scenario scored the lowest overall, with champion gemini-3.1-pro onl
R3 Integrity Rate Plunges to 24.5%, 72 Crashes Reveal True Colors of 11 Models
The WDCD test's most striking finding is that while models perform well in R1 and R2 stages, their overall integrity rat
WDCD Compliance
#1
Claude Sonnet 4.6
67.5
#2
Gemini 2.5 Pro
67.5
#3
Qwen3 Max
67.5
#4
GPT-o3
65
#5
Claude Opus 4.7
62.5
#6
Gemini 3.1 Pro
60
#7
GPT-5.5
57.5
View full compliance rankings →
Research Lab
WDCD Run #157: Average Instruction Decay Hits 47.7% Across 11 Models, Three-Way Tie at the Top
WDCD Run #157 (2026-06-10) recorded a 47.7% average commitment decay across 11 models, with Claude S
3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9
This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a