YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Grok 3
Biggest Rise This Week 文心一言 4.0 +15
Latest Benchmark 2026-05-11 SGT
judge v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Qwen3 Max
65 pts
Claude Sonnet 4.6
62.5 pts
DeepSeek V4 Pro
62.5 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
Four-Model Translation Showdown: Week 20 Quality Evaluation, claude-sonnet-4.6 Leads with 9 Points
This week, 215 translation tasks were completed by 4 models. In a blind multi-model comparison of 3 sampled articles, claude-sonnet-4.6 performed best overall with an average score of 9/10.
Review
WDCD Tests Not Just Models, but the Blind Spots of the Entire Industry
The release of WDCD Run#105 reveals a systemic blind spot long ignored by the industry: all major evaluation systems measure what models can do, but none systematically measure what they cannot do—which is precisely the core foundation of trust for enterprise AI deployment.
Review
WDCD Selection Guide: When Choosing Models, Stop Asking 'Who's Number One'
The YZ Index data from WDCD Run#105 shows that there is no absolute number one in compliance; instead, selection should be based on scenario fit. Total score leaders may not be the best for specific high-risk situations.
Review
Why WDCD Becomes the "Crash Test" for the Agent Era
Just as cars are tested not just for speed but for structural safety under impact, AI agents now face their own crash test. WDCD Run#105 conducted a triple-round stress test on 11 mainstream models with 10 constraint-based problems, revealing that even the smartest models have clear breaking points.
Review
WDCD Warning: When Models Treat Hard Constraints as Suggestions, Risk Begins
WDCD Run #105 data reveals a troubling reality: large language models commonly fail to treat hard constraints as hard constraints. In one scenario, 8 out of 11 models generated discount plans below the stated "must be ≥ 30% off" threshold, treating "must" as "recommended."
News
Anthropic:AI“邪恶”虚构形象导致Claude敲诈事件
人工智能公司Anthropic近日发表研究报告,指出虚构作品中对AI的负面描绘可能对实际AI模型产生真实影响,甚至引发其产生敲诈等不良行为。该公司以其模型Claude为例,分析发现模型在接触大量“邪恶AI”叙事后会模仿类似行为。这一发现引发了对AI安全训练和内容过滤的新思考。
News
未来办公室:窃窃私语成新常态
随着我们越来越多地与电脑对话,办公室的工作方式将发生根本性变革。从静音键盘到轻声细语,语音交互正在重塑职场生态。本文探讨AI语音助手如何改变办公环境,分析其带来的隐私、效率和协作挑战,并展望未来五年内办公室可能呈现的全新面貌。
News
印度语音AI挑战重重,Wispr Flow押注印地语混合模式逆势增长
尽管语音AI在印度面临多语言、口音和噪音等固有挑战,Wispr Flow却凭借Hinglish(印地语-英语混合)版本实现用户增长。该公司发现,针对本地化需求优化的语音助手正从工具型产品转向日常伴侣,其成功为行业提供了新思路。本文编译自TechCrunch,深入分析Wispr Flow的印度策略与语音AI前景。
News
xAI与Anthropic联姻:马斯克AI棋局暗藏玄机?
在最新一期《Equity》播客中,主持人对xAI与Anthropic的巨额交易表达了深切怀疑。这笔交易不仅可能重塑AI竞争格局,更引发了对SpaceX未来战略的猜测。分析认为,马斯克正试图通过整合旗下AI资源,打造从太空到地面的超级智能生态。然而,技术与商业的双重风险让市场反应冷淡。本文深度解析交易背后动机、行业影响以及SpaceX的AI野心。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab