YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Claude Sonnet 4.6
Biggest Rise This Week Qwen3 Max +68.5
Biggest Drop DeepSeek V3 -75.1
Latest Benchmark 2026-05-18 SGT
judge v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
89.8 pts
Runner-up
Grok 4
86.8 pts
Third Choice
Claude Sonnet 4.6
86.8 pts
Top Pick
Claude Opus 4.7
55.8 pts
Runner-up
Claude Sonnet 4.6
52.9 pts
Third Choice
Gemini 3.1 Pro
48.8 pts
Top Pick
Claude Sonnet 4.6
78.4 pts
Runner-up
Claude Opus 4.7
75.2 pts
Third Choice
Grok 4
73.9 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
文心一言 4.5
98.3 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 3.1 Pro
38.2 pts
Third Choice
Claude Sonnet 4.6
38 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Opus 4.7
0 pts
Third Choice
Claude Sonnet 4.6
0 pts
Claude Opus 4.7
65 pts
Claude Sonnet 4.6
62.5 pts
豆包 Pro
60 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
AI造伪引文入书,作者为何坚持使用?
作家Steven Rosenbaum的新书《真理的未来》中包含多条由AI生成的“合成引用”,这些引文看似真实实则虚构。尽管发现错误,Rosenbaum仍表示会继续使用AI辅助写作。这一事件揭示了生成式AI在创作中的可靠性危机,以及人类作者面对技术诱惑时的矛盾心态。本文深度分析AI虚假引用背后的行业困境与伦理边界。
News
就算你讨厌AI,也逃不过谷歌AI搜索
谷歌将AI融入搜索,提供量身定制的答案,极大提升便利性,但同时也让用户远离原始内容源。WIRED资深作者Steven Levy指出,这种看似无缝的体验正在掏空网络内容生态,损害创作者利益。尽管用户可能厌恶AI,但无法抗拒其高效,最终成为AI搜索的俘虏。
News
谷歌AI眼镜上手体验:离完美只差一步
TechCrunch记者体验了谷歌最新原型Android XR眼镜。这款设备由Gemini驱动,能将实时翻译、导航和信息提示直接叠加在用户视野中。它轻便、自然,交互流畅,展现了增强现实在日常场景中的巨大潜力。但仍有续航、视野宽度和内容生态等短板。谷歌似乎找到了正确方向,但距离消费级成熟产品还需要时间打磨。
News
编程的未来已来:Anthropic用Claude展示AI编码新范式
在Anthropic于伦敦举办的开发者活动“Code with Claude”上,公司展示了AI辅助编程的最新成果。与会者被问及是否曾用AI生成代码——这一问题的答案揭示了一个不可逆转的趋势:无论我们是否愿意,AI正在重塑软件开发的基础。本文深入分析Claude的编码能力、行业影响以及背后的技术挑战。
News
中国AI绘制全国可再生能源电网,引世界关注
在全球AI耗电量激增、电网承压的背景下,中国成功利用AI技术绘制了全国可再生能源电网地图,实现清洁能源的智能调度与预测。这一突破不仅缓解了AI算力对电网的冲击,更为全球能源转型提供了中国方案。美国PJM电网容量电价已暴涨十倍,而中国的AI能源管理正成为破解矛盾的关键。
News
OpenAI新加坡AI实验室落成,IMDA同步更新AI框架
OpenAI宣布将在新加坡设立其首个美国以外的应用AI实验室,作为与新加坡数字发展及信息部(MDDI)合作的一部分。该计划名为“OpenAI for Singapore”,在ATx峰会上公布,承诺投入超过3亿新元。实验室将专注于应用AI研究,同时新加坡资讯通信媒体发展局(IMDA)更新了国家AI治理框架,以加速AI安全部署,为全球AI治理树立新标杆。
News
谷歌I/O:AI驱动科学的路径正经历变革
在2026年Google I/O主题演讲中,DeepMind CEO Demis Hassabis宣称我们正“站在奇点的山麓”——这一论断引发热议。本文深度解析Google在AI for Science领域的最新动向,从AlphaFold的最新进展到材料科学、药物研发的新突破,探讨AI如何重塑科学研究范式,并分析其中蕴含的机遇与挑战。
News
马斯克与扎克伯格联手说服特朗普废除AI政令
原定签署的AI行政命令在最后时刻被美国总统特朗普取消,理由是避免削弱美国对华竞争优势。据知情人士透露,科技巨头马斯克和扎克伯格在幕后积极游说,认为过度监管将阻碍创新。这一事件凸显了科技巨头对美国AI政策的深度影响力,也引发了对中美AI竞争格局的新一轮讨论。
News
海湾AI热潮遭遇海底电缆瓶颈
随着中东超级计算中心加速落地,AI对带宽的渴求让海底电缆中断的风险急剧放大。海湾国家正面临一个矛盾:一边是飙升的数据需求,一边是老旧且脆弱的全球海底光缆网络。从船只抛锚到地缘政治摩擦,任何一次断裂都可能导致AI训练中断数周。本文剖析电缆危机如何倒逼海湾重新设计互联网基础设施。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab