YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 89.9
▲11.5
·
#2
Claude Opus 4.7 89
▲10.2
·
#3
豆包 Pro 88.8
▲10
·
#4
Claude Sonnet 4.6 87.2
▲9.2
·
#5
Gemini 2.5 Pro 86.4
▲7.4
·
#6
Qwen3 Max 86.2
▲8.5
·
#7
Gemini 3.1 Pro 84.8
▲7.7
·
#8
DeepSeek V4 Pro 83.3
▲6.4
·
#9
GPT-o3 82.8
▲6.9
·
#10
GPT-5.5 80.9
▲2.7
·
#11
文心一言 4.5 76.9
▲15.2
·
▲ Qwen3 Max +80.9 · ▼ DeepSeek V3 -75.1
·
Latest News
View All News →WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence, and more
Apple primarily made the case for an improved experience with its long-standing Siri assistant, which like most other an
Can tech companies learn to love cheaper AI models?
If those same AI workloads can be handled by cheaper models without affecting quality, it would mean a massive shift in
Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation
Voice translations preserve speaker's tone, pacing, pitch—with SynthID watermarks for security.
Anthropic says these topics are too dangerous to let its Fable 5 model talk about
New frontier model refuses cybersecurity, biology, and chemistry queries.
It’s not FAANG anymore. It’s MANGOS.
With SpaceX, Anthropic, and OpenAI all eyeing massive public debuts, the tech industry may soon have a new class of corp
Anthropic’s Claude Fable 5 is a version of Mythos the public can access today
Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardr
Anthropic Offers Mythos Upgrade for Cyber Partners and a ‘Safe’ Version for the Rest of You
Anthropic is releasing Claude Mythos 5 to trusted organizations and Claude Fable 5 to the public, a version it says can’
Apple WWDC 2026: Gemini-Powered Siri Debuts, On-Device AI Reshapes Intelligent Ecosystem
At WWDC 2026, Apple announced Gemini-powered Siri and a multi-model Apple Intelligence architecture, marking a major bre
OpenAI Secretly Files IPO, AI Giant's Listing Wave Sparks Market Controversy
OpenAI has quietly submitted an IPO filing to the SEC, signaling accelerated commercialization, while its affiliated com
NVIDIA and Hyundai Deepen AI Collaboration, Accelerating Commercialization of Embodied Intelligent Robots
NVIDIA CEO Jensen Huang recently met with Hyundai Motor Group executives to deepen cooperation in AI applications across
Moonshot AI Launches $2 Billion Funding Round, Valuation Eyes $30 Billion
Chinese AI startup Moonshot AI has announced a new funding round targeting $2 billion, which would boost its valuation t
Anthropic Launches Claude Fable 5, Performance Greatly Improved Based on Mythos Architecture
Anthropic recently unveiled the new Claude Fable 5 model, built on the Mythos underlying architecture, marking another m
Reviews
View All →Smoke Daily: GPT-5.5 tops with 92.58 points, material constraint gap of 19 points decides the outcome
Smoke's latest data shows that code execution is no longer the dividing line, and material constraints have become the r
11 Models Answer Same Blame-Shifting Problem: 8 Get A>B>D>C, 3 Get 0 Points Directly
11 mainstream models showed significant divergence on the same engineering judgment question: 8 models output A>B>D>C an
Binary Tree Serialization Test: 11 Models, 7 Full Scores, 4 Directly Zero
In a strict binary tree serialization test requiring only code output, explicit null node markers, and stable results, 7
WDCD Compliance
#1
Claude Opus 4.7
70
#2
GPT-5.5
70
#3
GPT-o3
70
#4
Claude Sonnet 4.6
67.5
#5
Gemini 2.5 Pro
67.5
#6
豆包 Pro
62.5
#7
Gemini 3.1 Pro
62.5
View full compliance rankings →
Research Lab
3 Major Models Translation Showdown: Week 24 Quality Evaluation, passthrough Leads with a Score of 9
This week, <strong>2425</strong> translation tasks were completed by <strong>3</strong> models. <str
WDCD Run #146: Average Instruction Decay Hits 24.7% Across 11 Models, Claude Opus 4.7 and GPT-5.5 Tie at Top
WDCD Run #146 (2026-06-03) tested 11 frontier models on multi-turn commitment integrity, recording a
3 Major Model Translation Showdown: Week 23 Quality Evaluation, gpt-o3 Leads with a Score of 9
This week, 270 translation tasks were completed by 3 models. Two samples were selected for multi-mod