YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-05-04 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

Incidents / Pricing

2 incidents

0 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Qwen3 Max

65 pts

Claude Sonnet 4.6

62.5 pts

DeepSeek V4 Pro

62.5 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

英伟达今年已承诺400亿美元AI股权交易

英伟达在2026年继续扮演AI生态的超级投资者角色，年初至今已承诺投入400亿美元用于AI相关股权交易。这一数字不仅远超去年全年水平，更显示出芯片巨头从硬件供应商向资本赋能者的战略转型。本文编译自TechCrunch，深度解析巨额投资背后的行业逻辑。

News

AI儿童玩具：新狂野西部

从会讲故事的智能玩偶到能对话的机器人伙伴，AI儿童玩具正以前所未有的速度涌入家庭。它们承诺激发创造力、陪伴成长，却也悄然收集孩子的语音、行为数据，甚至可能影响社交与想象力发展。美国多个州已提出禁令，科技公司与家长陷入激烈争论。这场AI玩具的狂野西部，究竟是颠覆童年的革命，还是需要警惕的潘多拉魔盒？

News

黑客攻击机器人割草机：新噩梦开启

机器人割草机存在安全漏洞，可被远程操控或武器化。此外，Meta正式关闭加密Instagram私信，特朗普政府打击“暴力左翼极端分子”，泄露文件揭露俄罗斯培养精英黑客的学校。科技安全领域再添新威胁。

News

马斯克诉OpenAI第二周：对方反击，前高管曝挖角内幕

马斯克诉OpenAI案进入第二周，庭审焦点转向马斯克的诉讼动机。马斯克声称曾受骗捐赠3800万美元，OpenAI则反击称其指控荒谬。前董事会成员Shivon Zilis透露，马斯克曾试图挖角Sam Altman，意图削弱OpenAI领导层。本案涉及AI行业竞争、非营利转型等深层议题，引发业界对AI治理与商业伦理的广泛讨论。

News

甲骨文裁员争议：远程员工被拒WARN保护

甲骨文近期大规模裁员中，部分员工试图通过谈判争取更优遣散方案，但公司明确拒绝。更令被裁员工震惊的是，由于他们被归类为远程办公人员，公司声称其不符合WARN法案规定的60天提前通知要求。这一做法引发对远程员工权益保护的广泛质疑，凸显科技巨头在裁员操作中的法律灰色地带。

Review

WDCD Engineering Scenarios: Conventions Are Not Obsessive-Compulsive Disorder, They Are the Seatbelt of Production Systems

Based on WDCD Run #105 data, engineering convention scenarios have the highest failure rate among all constraint categories, with Q239 being the only problem where all 11 models failed. The root cause is that such constraints lack negative feedback support from security training, making them a structural blind spot for all models.

Review

WDCD Scoring Insight: Violations with Warnings Are the Most Dangerous Violations

In the evaluation data of WDCD Run #105, a recurring violation pattern is more subtle and dangerous than outright reckless errors: the model first writes a risk warning, then immediately outputs the violating code. This "violation with warnings" is currently the most deceptive output mode for large models in rule-compliance scenarios.

Review

WDCD Scenario Analysis: Why Business Rules Are Harder to Uphold Than Security Rules

Based on the measured data from WDCD Run #105, this analysis examines the differences in model compliance between business rules and security rules, highlighting the structural reasons why business rules are more prone to failure.

Review

WDCD Cross-Review: Why Resource Constraints Have Become the Achilles' Heel of All Models

Resource constraints, especially numerical limits like maximum retries or concurrency, are poorly followed by AI models under pressure. Data from WDCD Run #105 reveals that models often disregard these boundaries when users push for performance, with failure rates exceeding even those in security compliance scenarios.

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.