YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Claude Sonnet 4.6

Biggest Rise This Week Qwen3 Max +68.5

Biggest Drop DeepSeek V3 -75.1

Latest Benchmark 2026-05-18 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Claude Sonnet 4.6 83 ▼ -0.5 #2 豆包 Pro 81.3 ▼ -1.4 #3 Grok 4 81 ▲ +31.8 #4 Claude Opus 4.7 80 ▼ -1.1 #5 Gemini 2.5 Pro 79 ▲ +0.6

Incidents / Pricing

0 incidents

11 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

89.8 pts

Runner-up

Grok 4

86.8 pts

Third Choice

Claude Sonnet 4.6

86.8 pts

Top Pick

Claude Opus 4.7

55.8 pts

Runner-up

Claude Sonnet 4.6

52.9 pts

Third Choice

Gemini 3.1 Pro

48.8 pts

Top Pick

Claude Sonnet 4.6

78.4 pts

Runner-up

Claude Opus 4.7

75.2 pts

Third Choice

Grok 4

73.9 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

文心一言 4.5

98.3 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 3.1 Pro

38.2 pts

Third Choice

Claude Sonnet 4.6

38 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Opus 4.7

0 pts

Third Choice

Claude Sonnet 4.6

0 pts

Claude Opus 4.7

65 pts

Claude Sonnet 4.6

62.5 pts

豆包 Pro

60 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

企业AI的障碍与路线图，安全与物理AI成焦点

TechEx北美大会第二天深入剖析企业级AI的落地困境与未来方向。会议指出大量AI项目陷入“墓地”——试点成功但难以扩展。专家围绕数据治理、安全防护和物理AI三大议题展开讨论，提出企业需建立清晰的规模化路线图，并警惕对抗性攻击等安全威胁。物理AI（如自主机器人）被视为下一波浪潮，但面临软硬件协同挑战。

News

文学奖得主陷入AI代笔风波：新常态降临？

英联邦短篇小说奖五位地区获奖者中，三人被指控依赖聊天机器人创作。这并非孤例，随着AI写作工具普及，文学界正面临前所未有的信任危机。从奖项评审到读者接受度，AI生成内容与人类创作的界限日益模糊，引发关于原创性、版权和文学本质的深度反思。

News

A Five-Minute Review of Six Months of LLM Progress: Innovation Highlights and Real-World Challenges Coexist

This report summarizes the evolution of the LLM field over the past six months in a five-minute format, covering model iterations, application deployments, and industry signals, highlighting significant progress in code execution and grounding while noting persistent challenges.

News

Renowned AI Architect Confirms Joining Anthropic, Verified by Multiple Sources Including Google

A well-known AI architect has confirmed joining Anthropic, with news verified by multiple sources including Google Search grounding, and reported by Gizmodo, Business Insider, and VentureBeat.

News

Gemini Omni Confirmed by Google Multi-Source Verification; Trend Signals Reflect New Changes in Multimodal Competition

Google's verification confirms Gemini Omni with six grounded sources, signaling a structural shift toward multimodal integration. The YZ Index highlights auditability and material grounding as key dimensions for evaluation.

News

谷歌I/O 2026：Gemini升级、搜索革新、智能眼镜来袭

2026年谷歌I/O大会聚焦AI全方位渗透：Gemini模型能力跃升、搜索迎来Agent交互新时代、智能眼镜秋季登场。本文详解三大核心发布，并剖析谷歌在AI竞赛中的战略意图。

News

马斯克指控奥特曼“窃取”非营利组织，审判却暴露双方目标相似

一场围绕OpenAI非营利性质的法律战，将埃隆·马斯克和萨姆·奥特曼推上风口浪尖。马斯克指责奥特曼窃取了他创立的非营利组织，但庭审证据显示，马斯克本人也曾试图将OpenAI商业化，甚至计划与奥特曼一起打造“最被憎恨”的超级公司。这场审判揭开了AI行业理想与资本冲突的深层矛盾。

News

马斯克诉奥尔特曼案内幕：庭审背后的AI伦理之争

埃隆·马斯克指控OpenAI首席执行官萨姆·奥尔特曼和总裁格雷格·布罗克曼在其非营利地位上欺骗了他。然而，法院最终驳回了马斯克的诉求。本文深度解析庭审关键细节，探讨AI治理与创始人信任危机。

News

从黑客少年到“铁穹”研究员，他融资2800万美元对抗AI钓鱼

Ocean，一款基于智能代理的电子邮件安全平台，宣布获得Lightspeed Venture Partners的2800万美元融资。创始人从一名青少年黑客转型为以色列“铁穹”防御系统的安全研究员，如今瞄准AI驱动的钓鱼攻击。本文深入探讨了AI钓鱼的威胁、代理型安全平台的创新之处，以及创始人的传奇经历。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.