YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Grok 3
Biggest Rise This Week 文心一言 4.0 +15
Latest Benchmark 2026-04-27 SGT
judge v6

Who to Use Right Now

Start with the overall ranking, then drill into the dimension you care about.

The full leaderboard shows not just who's leading, but how stable that lead is. View Full Leaderboard

Who's Up, Who's Down

One-time spikes don't count. We care about whether sustained performance has shifted.

Biggest change this week: 文心一言 4.0 rose 15 pts。
View Full Change Report
Biggest Gain
文心一言 4.0
+15
Incident Reports
This week 2 incidents
Pricing Changes
0 updates

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
Grok 3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
Claude Opus 4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
Grok 3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
Claude Opus 4.6
79.7 pts
Top Pick
DeepSeek V3
99.7 pts
Runner-up
文心一言 4.0
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
Claude Opus 4.6
36.6 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
初创公司新工具让LLM调试如探囊取物
旧金山初创公司Goodfire发布名为Silico的新工具,允许研究人员和工程师在训练期间深入AI模型内部,调整其参数——即决定模型行为的设置。这为模型制造商提供了前所未有的精细控制能力,改变了以往对AI技术构建方式的认知。Goodfire声称Silico能显著提升模型的可解释性和可靠性。
News
Salesforce让客户主导AI路线图:众包企业需求
Salesforce正采用一种创新的产品开发策略:让客户主导其AI路线图。公司认为,如果一个企业客户面临某个问题,其他客户很可能也有类似需求。通过建立客户咨询委员会、收集反馈和优先处理高频需求,Salesforce将AI功能开发从内部决策转向众包模式,以更快响应市场变化并提高产品相关性。
News
Stripe推出Link数字钱包,AI代理也能自主支付
Stripe最新发布的Link数字钱包不仅支持用户绑定银行卡、银行账户和订阅服务,还创新性地允许AI代理通过审批流程安全地代表用户进行支付。这一功能为自动化电商和AI驱动的服务场景打开了新大门,预计将加速AI代理在金融交易领域的应用。
News
OpenAI推出高级安全模式,保护高危账户
OpenAI宣布为其ChatGPT和Codex等账户推出高级安全模式,旨在防护针对高风险用户的钓鱼攻击。该功能通过多因素认证和行为分析增强账户安全,尤其适用于记者、活动家等易受攻击人群。业内分析认为,此举反映了AI服务在隐私与安全方面的进化趋势,但也可能引发用户体验与安全性的平衡讨论。
News
马斯克宣誓下承认xAI使用OpenAI模型训练
在法庭宣誓作证时,埃隆·马斯克承认其AI公司xAI使用了OpenAI的模型进行训练。他辩称,这是AI实验室的普遍做法,即利用竞争对手的模型来提升自身技术。这一言论引发了对AI行业竞争与知识产权边界的广泛讨论。
News
谷歌Gemini AI助手即将登陆数百万车辆
谷歌宣布将从5月起向搭载Google内置系统的汽车逐步推送Gemini AI助手,取代现有的Google Assistant。这一升级旨在为驾驶体验带来更先进、更自然的对话式AI交互。紧随通用汽车此前宣布将整合Gemini的消息,谷歌此举标志着AI助手在车载领域的竞争进一步升温。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Not because we're loud, but because our methods are open, rules are fixed, and results are traceable.

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab