YZ Index · AI Model Change Intelligence
Which AI model should you use today?
We benchmark them every week.
11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.
Code Sandbox Execution
Citation Accuracy Check
Statistical Significance Ranking
Compliance Testing
No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average)
Grok 3
Biggest Rise This Week
文心一言 4.0 +15
Latest Benchmark
2026-05-04 SGT
judge
v6
0
Models Tested
0
Test Questions
0
DCD Scenarios
5 categories x 6 questions
Weekly
Auto-evaluation frequency
Overall Top 5Rolling average
Full RankingsQuick Scene Lookup
Recommend by ScenarioWeekly Signals
Changes ReportDon't just look at the overall score — consider your use case
Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Claude Opus 4.7
67.5 pts
GPT-o3
66.7 pts
Claude Sonnet 4.6
63.3 pts
Worth reading today — beyond the hype
We only feature content that impacts capability, pricing, stability, or model selection.
News
黄仁勋:AI正在创造大量就业,而非毁灭工作
面对公众对AI取代人类工作的担忧,英伟达CEO黄仁勋在最新采访中表示,这些焦虑被大大夸大了。他认为AI实际上正在创造“海量”的工作机会,尤其是在AI开发、部署和优化领域。本文结合TechCrunch报道,深度剖析黄仁勋的观点,并探讨AI与就业关系的真实图景。
News
WDCD Run #100: Average Instruction Decay Hits 39.1% Across 11 Models, Claude Opus 4.7 Leads
WDCD Run #100 (2026-05-03) tested 11 frontier models on multi-turn commitment integrity, recording an average instruction decay of 39.1% from Round 1 to Round 3. Claude Opus 4.7 took the top spot at 67.5 points with only 23% decay.
News
OpenAI密友Cerebras冲刺266亿美元IPO
AI芯片制造商Cerebras正筹备一场重磅IPO,估值有望达266亿美元甚至更高。作为OpenAI的深度合作伙伴,Cerebras凭借其独特的大规模晶圆级芯片技术,在AI计算基础设施领域占据关键位置。此次上市不仅将验证其商业模式,更折射出AI芯片赛道的白热化竞争与资本狂热。
News
格雷格·布罗克曼捍卫300亿美元OpenAI股权:“血汗泪水”
OpenAI联合创始人兼总裁格雷格·布罗克曼周一出庭联邦法院,透露自己是这家AI实验室的最大个人股东之一。他在证词中坚称,其持有的价值约300亿美元的股权是通过“血汗和泪水”赢得的,回应了外界对其报酬过高的质疑。此案引发了对AI公司股权分配和创始人承诺的广泛讨论。
News
AI Chip Startups Wayve and Rebellions Secure Massive Funding: AMD, Qualcomm, and Arm Back Wayve, Samsung-Backed Rebellions Raises $400 Million
AI chip startups Wayve and Rebellions have secured significant funding from major tech companies, reflecting the growing demand for advanced AI chips. This article analyzes the technical principles, impacts, and future trends from Winzheng's perspective.
News
FlexRule Releases AI Agent Governance Update: Enabling End-to-End Governance to Enhance AI Decision Reliability and Compliance
FlexRule has announced a new update to its decision platform that delivers end-to-end governance for AI Agents, aiming to make AI governance practical and address challenges in decision-making. The update emphasizes reliability and compliance in agentic systems.
News
Gary Marcus's Critique of Generative AI Sparks Debate: X Post Receives Thousands of Likes, Opinions Polarized
On May 3, 2026, prominent AI critic Gary Marcus posted a detailed thread on X platform outlining the reasons for the growing backlash against generative AI, citing negative impacts on education, deepfakes, misinformation, and environmental damage from data centers. The post quickly went viral, garnering thousands of likes and hundreds of replies, sharply dividing supporters and detractors.
News
Klaimee AI Officially Launches on Y Combinator: First Algerian Female Founder Introduces AI Agent Insurance, Highlighting Diversity in AI Entrepreneurship
Klaimee AI, founded by Ines Boutemadja, has launched on Y Combinator's Launch YC platform, offering insurance specifically for AI agents. This marks the first Algerian female founder in YC, underscoring the growing diversity in AI entrepreneurship.
News
非官方“Mac版Notepad++”引原作者抗议
一个由爱好者“vibe-coded”的非官方Notepad++ macOS版本在开发者社区引起争议。原作者明确声明:“Notepad++从未发布过macOS版本。”该软件不仅涉嫌借用知名开源项目的名称和品牌,还因代码质量和安全性问题遭到质疑。本文梳理事件始末,分析开源生态中第三方“魔改”与原作者权益的冲突。
Not all AI news is worth reading. What matters is what changes your judgment. View All News
Why This Leaderboard Is Worth Your Attention
0
Models Tested
Fully transparent
0
Open Questions
Random sampling
30
Compliance Scenarios
Zero AI judging
1998
Founded
Continuously operating
0
Vendor Sponsors
Fully independent
Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.
The AI world changes daily — you need a reliable source
3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.
- Daily Picks — From the flood of AI news, we pick the 3 that truly matter
- YZ Index Weekly — Who's up, who's down — one email covers it all
- Model Incident Alerts — When a model you use has an issue, know immediately
- Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime
Want deeper analysis? Go further.
The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.
Enter Research Lab