YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) DeepSeek V3
Biggest Rise This Week Claude Sonnet 4.6 +5
Biggest Drop GPT-4o -18.7
Latest Benchmark 2026-03-30 SGT
judge v6

Who to Use Right Now

Start with the overall ranking, then drill into the dimension you care about.

The full leaderboard shows not just who's leading, but how stable that lead is. View Full Leaderboard

Who's Up, Who's Down

One-time spikes don't count. We care about whether sustained performance has shifted.

The biggest movers this week are Claude Sonnet 4.6(+5)and GPT-4o(-18.7)。Changes exceeding 1 standard deviation are flagged as statistically significant.
View Full Change Report
Biggest Gain
Claude Sonnet 4.6
+5
Biggest Loss
GPT-4o
-18.7
Incident Reports
This week 2 incidents
Pricing Changes
0 updates

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
90.6 pts
Runner-up
DeepSeek V3
90.5 pts
Third Choice
Claude Sonnet 4.6
88 pts
Top Pick
Claude Opus 4.6
48.8 pts
Runner-up
Grok 3
48.8 pts
Third Choice
Claude Sonnet 4.6
46 pts
Top Pick
Grok 3
79.2 pts
Runner-up
DeepSeek R1
79 pts
Third Choice
DeepSeek V3
78.3 pts
Top Pick
DeepSeek V3
91.1 pts
Runner-up
文心一言 4.0
90.9 pts
Third Choice
豆包 Pro
87 pts
Top Pick
Claude Sonnet 4.6
54.3 pts
Runner-up
豆包 Pro
53.9 pts
Third Choice
Claude Opus 4.6
53.9 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
A Massive Power Outage Affects 5 Million in Tokyo! Fujitsu's AI Grid System Failure Sparks Global Debate on AI Safety
On a certain day in 2024, Tokyo faced an unprecedented power crisis due to a cascading failure in Fujitsu's AI grid management system, leaving over 5 million people in darkness. The incident not only caused Fujitsu's stock price to plummet but also sparked a global reevaluation of AI systems' safety in critical infrastructures.
News
U.S. Senate Passes First AI Regulation Bill: High-Risk AI Systems Require Mandatory Audits; Compliance Costs for Tech Giants May Surge Tenfold
On December 19, 2024, the U.S. Senate officially passed the landmark National AI Safety Act, marking the world’s first comprehensive regulatory bill targeting AI systems. The act mandates rigorous audits and transparency requirements for high-risk AI systems, potentially reshaping the AI industry landscape globally.
News
OpenAI's Claim of GPT-7 Approaching AGI Sparks Intense Debate: Technological Breakthrough or Dangerous Hype?
OpenAI CEO Sam Altman's announcement about GPT-7 nearing AGI has sparked a heated debate in the tech world. Despite the buzz, the true capability of GPT-7 remains shrouded in mystery due to the lack of disclosed technical details.
News
Anthropic:Claude Code用户使用OpenClaw需额外付费
Anthropic宣布,Claude Code订阅用户在使用OpenClaw及其他第三方工具时,将需额外付费。这项变化将提高编码助手的整体使用成本,引发开发者社区关注。随着AI编码工具竞争加剧,此举反映了Anthropic在平衡创新与盈利间的策略调整。Claude Code作为Anthropic的核心产品,正面临GitHub Copilot等强劲对手,额外收费或旨在覆盖高昂的计算资源支出,同时推动生态系统发展。(128字)
News
黑客散布Claude代码泄露,还捆绑恶意软件
黑客正在网络上传播Anthropic Claude AI模型的源代码泄露文件,并恶意附赠木马病毒,诱导下载者感染设备。同时,FBI警告其窃听工具遭黑客入侵构成国家安全威胁;攻击者窃取Cisco源代码,作为持续供应链攻击的一部分。此事件凸显AI安全与供应链漏洞的双重危机,专家呼吁加强代码防护与情报共享。(128字)
News
Anthropic私人市场风头正劲,SpaceX IPO或搅局
Rainmaker Securities总裁Glen Anderson表示,私人股份二级市场从未如此活跃,Anthropic成为最热门交易标的,OpenAI逐渐失势,而SpaceX即将IPO可能重塑整个格局。这一趋势反映出AI投资热潮转向更注重安全与稳定的公司,私人市场估值飙升,但SpaceX的公开上市或将分流资金,影响AI独角兽的流动性与定价。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Not because we're loud, but because our methods are open, rules are fixed, and results are traceable.

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab