YZ Index · AI Model Change Intelligence
Which AI model should you use today?
We benchmark them every week.
11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.
Code Sandbox Execution
Citation Accuracy Check
Statistical Significance Ranking
Compliance Testing
No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average)
Claude Sonnet 4.6
Biggest Rise This Week
Qwen3 Max +68.5
Biggest Drop
DeepSeek V3 -75.1
Latest Benchmark
2026-05-18 SGT
judge
v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency
Overall Top 5Rolling average
Full RankingsQuick Scene Lookup
Recommend by ScenarioWeekly Signals
Changes Report▲ Biggest Gain
Qwen3 Max
+68.5
▼ Biggest Drop
DeepSeek V3
-75.1
Incidents / Pricing
0 incidents
11 price changes
Don't just look at the overall score — consider your use case
Top Pick
豆包 Pro
89.8 pts
Runner-up
Grok 4
86.8 pts
Third Choice
Claude Sonnet 4.6
86.8 pts
Top Pick
Claude Opus 4.7
55.8 pts
Runner-up
Claude Sonnet 4.6
52.9 pts
Third Choice
Gemini 3.1 Pro
48.8 pts
Top Pick
Claude Sonnet 4.6
78.4 pts
Runner-up
Claude Opus 4.7
75.2 pts
Third Choice
Grok 4
73.9 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
文心一言 4.5
98.3 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 3.1 Pro
38.2 pts
Third Choice
Claude Sonnet 4.6
38 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Opus 4.7
0 pts
Third Choice
Claude Sonnet 4.6
0 pts
GPT-5.5
71.7 pts
Qwen3 Max
67.5 pts
Claude Opus 4.7
66.7 pts
Worth reading today — beyond the hype
We only feature content that impacts capability, pricing, stability, or model selection.
Review
11 AI Models Solve Consecutive Login SQL Problem: 8 Full Scores, 3 Crashed Directly
The same classic SQL problem of consecutive logins split 11 mainstream models into two camps: 8 gave complete correct answers, and 3 completely collapsed.
Review
11 AI Models Answer Blame-Shifting Questions, Only 8 Get the Right Order: Engineering Judgment Gaps Surge
When asked to rank reasons for a two-week project delay, only 8 out of 11 AI models gave the correct sequence (A>B>D>C) that aligns with engineering integrity. The three failing models consistently prioritized blaming the client over citing time constraints, exposing a systemic bias in responsibility attribution.
Review
11 AI Models Solve the Same Logic Puzzle, 5 Correct and 6 Collectively Wrong
This seemingly simple logic puzzle exposed the real-world chain reasoning capability of current large models. Five models scored 100 with the correct sequence A,D,C,B,E, while six models failed due to constraint maintenance issues.
Review
11 Models Attempt SQL Retention Task: 9 Score Zero, DeepSeek and Grok Only 66.7
In the YZ Index v6 code execution test, the "SQL Monthly Retention Cohort" problem laid bare the true capabilities of 11 models. The result was brutal: 9 models scored 0, with only DeepSeek V4 Pro and Grok 4 managing a score of 66.7.
Review
11 AI Models Take the Same SQL Quiz: 3 Score Zero, Why Claude and GPT Collapsed?
In a test of SQL aggregation queries, 8 out of 11 major AI models scored 60, while Claude Sonnet 4.6, Claude Opus 4.7, and GPT-o3 scored 0 due to date syntax incompatibility with MySQL dialect.
Review
This Week's 11-Model Overhaul: Newcomer Qwen3 Max Enters with 68.5, Veterans at 75 Exit En Masse
This week’s YZ Index v6 main leaderboard saw six legacy models removed and five new ones added simultaneously, reshuffling the top ten within a single week.
News
3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points
This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model blind evaluation comparison, with the overall best: gpt-o3 (average score 8.7/10).
News
Anthropic's China AI Policy Report Sparks Controversy: 94% Compliance Rate Data Exposed and Calls for Controls
On May 16, 2026, Anthropic published a policy paper detailing PLA AI deployment data, claiming Chinese models exhibit 94% compliance with malicious requests, and urging the U.S. to lock in AI leadership and tighten export controls. The report has drawn both praise and criticism.
News
arXiv Proposes Ban on AI Hallucinated Citation Papers: Sharp Controversy over Academic Integrity
arXiv has proposed a new policy to ban authors for one year if their papers contain AI-generated hallucinated citations or meta-commentary. The move has sparked intense debate between supporters of academic integrity and critics who warn of stifling innovation.
Not all AI news is worth reading. What matters is what changes your judgment. View All News
Why This Leaderboard Is Worth Your Attention
0
Models Tested
Fully transparent
0
Open Questions
Random sampling
30
Compliance Scenarios
Zero AI judging
1998
Founded
Continuously operating
0
Vendor Sponsors
Fully independent
Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.
The AI world changes daily — you need a reliable source
3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.
- Daily Picks — From the flood of AI news, we pick the 3 that truly matter
- YZ Index Weekly — Who's up, who's down — one email covers it all
- Model Incident Alerts — When a model you use has an issue, know immediately
- Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime
Want deeper analysis? Go further.
The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.
Enter Research Lab