YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Grok 3
Biggest Rise This Week 文心一言 4.0 +15
Latest Benchmark 2026-05-04 SGT
judge v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Qwen3 Max
65 pts
Claude Sonnet 4.6
62.5 pts
DeepSeek V4 Pro
62.5 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
IVF技术革新与阳台太阳能崛起
本文编译自MIT Technology Review的每日科技简报。一方面,体外受精(IVF)在过去四十年已帮助数百万婴儿诞生,但过程仍缓慢、痛苦且昂贵,技术革新正试图改变这一现状;另一方面,阳台太阳能作为一种新兴的分布式能源解决方案,正以低门槛、易安装的特点在全球家庭中快速普及。两大趋势共同折射出科技如何从医疗与能源两端重塑人类生活。
News
Spotify AI DJ新增四门语言,个性化推荐再进化
Spotify近日宣布其AI DJ功能正式支持法语、德语、意大利语和巴西葡萄牙语,进一步拓展了该功能的全球覆盖范围。这一更新基于OpenAI的语音技术,能够以更自然的语调进行音乐推荐和评论。随着多语言支持的上线,Spotify在个性化音乐体验上迈出了重要一步,同时也引发了关于AI与音乐行业交互的更多讨论。本文编译自TechCrunch。
News
Spotify欲打造AI个人音频内容新家园
Spotify正在探索将AI生成个人音频内容纳入平台的新方向。用户可通过Codex或Claude Code等AI工具创建播客并直接导入Spotify,使每个人都能轻松制作个性化音频。这一举措不仅将丰富Spotify的内容生态,也可能彻底改变音频内容的创作和分发方式。
News
马斯克曾试图挖角OpenAI创始人在特斯拉建AI部门
据最新报道,埃隆·马斯克曾试图从OpenAI挖走其创始人,在特斯拉内部成立一个独立的AI部门,前提是他必须获得完全控制权。这一举动揭示了马斯克对AI技术主导权的强烈渴望,以及他与OpenAI之间日益紧张的关系。分析人士认为,这可能是马斯克打造自己AI帝国计划的一部分。
News
开源AI需求井喷,月之暗面融资20亿美元估值达200亿
中国AI独角兽Moonshot AI(月之暗面)宣布完成20亿美元新一轮融资,估值飙升至200亿美元。这轮融资发生在全球开源AI需求激增的背景下。该公司4月年化经常性收入(ARR)突破2亿美元,主要得益于付费订阅和API使用量的快速增长。本轮融资由红杉中国、阿里巴巴等领投,资金将用于扩大模型训练规模与开源生态建设。
News
阳台太阳能热袭美国:插电即用,减排省电
美国数十个州正在考虑立法,允许居民安装无需专业施工的插入式太阳能系统(即“阳台太阳能”)。这类微型光伏阵列在欧洲已普及,能显著降低电费和碳排放。支持者认为,这套系统有望打破美国太阳能普及的障碍,让租户和公寓居民也能享受清洁能源。本文梳理立法动态、技术优势与潜在挑战。
News
雷鬼乐队与AI混音“噩梦”之战
当Stick Figure六年前的歌曲突然登上排行榜,乐队一度欣喜若狂。然而,这次病毒式传播的推手竟是未经授权的AI混音——这些低质量、无版权的作品在流媒体平台泛滥,迫使乐队陷入一场维护原创音乐尊严的漫长战斗。本文编译自WIRED。
News
AI编程工具酿祸:数千应用泄露企业及个人数据
以Lovable、Base44、Replit、Netlify为代表的AI驱动开发平台,让任何人都能在几秒内构建Web应用。然而安全研究人员发现,已有成千上万通过这类“氛围编码”(vibe-coding)生成的应用,将数据库凭证、API密钥、用户信息等高度敏感数据直接暴露于公网,构成严重安全风险。
News
IVF的未来:突破与挑战并存
48年前,路易丝·布朗成为世界首例试管婴儿。此后,数百万试管婴儿借助技术进步得以诞生,IVF变得更安全、更有效。然而,它仍不完美——成功率、伦理争议、高昂成本等问题亟待解决。本文梳理IVF技术演进,展望基因编辑、AI辅助、子宫内膜受体分析等前沿方向,并探讨如何让这一技术惠及更多家庭。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab