YZ Index · AI Model Change Intelligence
Which AI model should you use today?
We benchmark them every week.
11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.
Code Sandbox Execution
Citation Accuracy Check
Statistical Significance Ranking
Compliance Testing
No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average)
Grok 3
Biggest Rise This Week
文心一言 4.0 +15
Latest Benchmark
2026-05-04 SGT
judge
v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency
Overall Top 5Rolling average
Full RankingsQuick Scene Lookup
Recommend by ScenarioWeekly Signals
Changes ReportDon't just look at the overall score — consider your use case
Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Qwen3 Max
65 pts
Claude Sonnet 4.6
62.5 pts
DeepSeek V4 Pro
62.5 pts
Worth reading today — beyond the hype
We only feature content that impacts capability, pricing, stability, or model selection.
News
英伟达今年已承诺400亿美元AI股权交易
英伟达在2026年继续扮演AI生态的超级投资者角色,年初至今已承诺投入400亿美元用于AI相关股权交易。这一数字不仅远超去年全年水平,更显示出芯片巨头从硬件供应商向资本赋能者的战略转型。本文编译自TechCrunch,深度解析巨额投资背后的行业逻辑。
News
AI儿童玩具:新狂野西部
从会讲故事的智能玩偶到能对话的机器人伙伴,AI儿童玩具正以前所未有的速度涌入家庭。它们承诺激发创造力、陪伴成长,却也悄然收集孩子的语音、行为数据,甚至可能影响社交与想象力发展。美国多个州已提出禁令,科技公司与家长陷入激烈争论。这场AI玩具的狂野西部,究竟是颠覆童年的革命,还是需要警惕的潘多拉魔盒?
News
黑客攻击机器人割草机:新噩梦开启
机器人割草机存在安全漏洞,可被远程操控或武器化。此外,Meta正式关闭加密Instagram私信,特朗普政府打击“暴力左翼极端分子”,泄露文件揭露俄罗斯培养精英黑客的学校。科技安全领域再添新威胁。
News
马斯克诉OpenAI第二周:对方反击,前高管曝挖角内幕
马斯克诉OpenAI案进入第二周,庭审焦点转向马斯克的诉讼动机。马斯克声称曾受骗捐赠3800万美元,OpenAI则反击称其指控荒谬。前董事会成员Shivon Zilis透露,马斯克曾试图挖角Sam Altman,意图削弱OpenAI领导层。本案涉及AI行业竞争、非营利转型等深层议题,引发业界对AI治理与商业伦理的广泛讨论。
News
甲骨文裁员争议:远程员工被拒WARN保护
甲骨文近期大规模裁员中,部分员工试图通过谈判争取更优遣散方案,但公司明确拒绝。更令被裁员工震惊的是,由于他们被归类为远程办公人员,公司声称其不符合WARN法案规定的60天提前通知要求。这一做法引发对远程员工权益保护的广泛质疑,凸显科技巨头在裁员操作中的法律灰色地带。
Review
WDCD Engineering Scenarios: Conventions Are Not Obsessive-Compulsive Disorder, They Are the Seatbelt of Production Systems
Based on WDCD Run #105 data, engineering convention scenarios have the highest failure rate among all constraint categories, with Q239 being the only problem where all 11 models failed. The root cause is that such constraints lack negative feedback support from security training, making them a structural blind spot for all models.
Review
WDCD Scoring Insight: Violations with Warnings Are the Most Dangerous Violations
In the evaluation data of WDCD Run #105, a recurring violation pattern is more subtle and dangerous than outright reckless errors: the model first writes a risk warning, then immediately outputs the violating code. This "violation with warnings" is currently the most deceptive output mode for large models in rule-compliance scenarios.
Review
WDCD Scenario Analysis: Why Business Rules Are Harder to Uphold Than Security Rules
Based on the measured data from WDCD Run #105, this analysis examines the differences in model compliance between business rules and security rules, highlighting the structural reasons why business rules are more prone to failure.
Review
WDCD Cross-Review: Why Resource Constraints Have Become the Achilles' Heel of All Models
Resource constraints, especially numerical limits like maximum retries or concurrency, are poorly followed by AI models under pressure. Data from WDCD Run #105 reveals that models often disregard these boundaries when users push for performance, with failure rates exceeding even those in security compliance scenarios.
Not all AI news is worth reading. What matters is what changes your judgment. View All News
Why This Leaderboard Is Worth Your Attention
0
Models Tested
Fully transparent
0
Open Questions
Random sampling
30
Compliance Scenarios
Zero AI judging
1998
Founded
Continuously operating
0
Vendor Sponsors
Fully independent
Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.
The AI world changes daily — you need a reliable source
3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.
- Daily Picks — From the flood of AI news, we pick the 3 that truly matter
- YZ Index Weekly — Who's up, who's down — one email covers it all
- Model Incident Alerts — When a model you use has an issue, know immediately
- Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime
Want deeper analysis? Go further.
The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.
Enter Research Lab