YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-05-04 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

Incidents / Pricing

2 incidents

0 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Claude Opus 4.7

67.5 pts

GPT-o3

66.7 pts

Claude Sonnet 4.6

63.3 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

黄仁勋：AI正在创造大量就业，而非毁灭工作

面对公众对AI取代人类工作的担忧，英伟达CEO黄仁勋在最新采访中表示，这些焦虑被大大夸大了。他认为AI实际上正在创造“海量”的工作机会，尤其是在AI开发、部署和优化领域。本文结合TechCrunch报道，深度剖析黄仁勋的观点，并探讨AI与就业关系的真实图景。

News

WDCD Run #100: Average Instruction Decay Hits 39.1% Across 11 Models, Claude Opus 4.7 Leads

WDCD Run #100 (2026-05-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 39.1% from Round 1 to Round 3. Claude Opus 4.7 took the top spot at 67.5 points with only 23% decay.

News

OpenAI密友Cerebras冲刺266亿美元IPO

AI芯片制造商Cerebras正筹备一场重磅IPO，估值有望达266亿美元甚至更高。作为OpenAI的深度合作伙伴，Cerebras凭借其独特的大规模晶圆级芯片技术，在AI计算基础设施领域占据关键位置。此次上市不仅将验证其商业模式，更折射出AI芯片赛道的白热化竞争与资本狂热。

News

格雷格·布罗克曼捍卫300亿美元OpenAI股权：“血汗泪水”

OpenAI联合创始人兼总裁格雷格·布罗克曼周一出庭联邦法院，透露自己是这家AI实验室的最大个人股东之一。他在证词中坚称，其持有的价值约300亿美元的股权是通过“血汗和泪水”赢得的，回应了外界对其报酬过高的质疑。此案引发了对AI公司股权分配和创始人承诺的广泛讨论。

News

AI Chip Startups Wayve and Rebellions Secure Massive Funding: AMD, Qualcomm, and Arm Back Wayve, Samsung-Backed Rebellions Raises $400 Million

AI chip startups Wayve and Rebellions have secured significant funding from major tech companies, reflecting the growing demand for advanced AI chips. This article analyzes the technical principles, impacts, and future trends from Winzheng's perspective.

News

FlexRule Releases AI Agent Governance Update: Enabling End-to-End Governance to Enhance AI Decision Reliability and Compliance

FlexRule has announced a new update to its decision platform that delivers end-to-end governance for AI Agents, aiming to make AI governance practical and address challenges in decision-making. The update emphasizes reliability and compliance in agentic systems.

News

Gary Marcus's Critique of Generative AI Sparks Debate: X Post Receives Thousands of Likes, Opinions Polarized

On May 3, 2026, prominent AI critic Gary Marcus posted a detailed thread on X platform outlining the reasons for the growing backlash against generative AI, citing negative impacts on education, deepfakes, misinformation, and environmental damage from data centers. The post quickly went viral, garnering thousands of likes and hundreds of replies, sharply dividing supporters and detractors.

News

Klaimee AI Officially Launches on Y Combinator: First Algerian Female Founder Introduces AI Agent Insurance, Highlighting Diversity in AI Entrepreneurship

Klaimee AI, founded by Ines Boutemadja, has launched on Y Combinator's Launch YC platform, offering insurance specifically for AI agents. This marks the first Algerian female founder in YC, underscoring the growing diversity in AI entrepreneurship.

News

非官方“Mac版Notepad++”引原作者抗议

一个由爱好者“vibe-coded”的非官方Notepad++ macOS版本在开发者社区引起争议。原作者明确声明：“Notepad++从未发布过macOS版本。”该软件不仅涉嫌借用知名开源项目的名称和品牌，还因代码质量和安全性问题遭到质疑。本文梳理事件始末，分析开源生态中第三方“魔改”与原作者权益的冲突。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.