YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking Compliance Testing No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-04-27 SGT

judge v6

Models Tested

Test Questions

DCD Scenarios

5 categories x 6 questions

Weekly

Auto-evaluation frequency

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

Incidents / Pricing

2 incidents

0 price changes

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Qwen3 Max

70 pts

GPT-5.5

68.3 pts

Claude Opus 4.7

66.7 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

迪士尼乐园引入游客面部识别系统

迪士尼乐园正式启用面部识别技术用于游客身份验证，引发隐私担忧。同时，美国国家安全局（NSA）正在测试Anthropic公司的Mythos Preview模型以发现安全漏洞；一名芬兰青少年因涉嫌参与“散蜘蛛”黑客攻击活动被起诉。此外，还有更多科技安全动态。

News

马斯克庭审首周：自曝被欺骗，AI恐毁灭人类

在马斯克诉OpenAI里程碑式庭审的第一周，马斯克身着西装出庭，指责CEO阿尔特曼和总裁布罗克曼欺骗他出资创办公司。他同时警告AI可能毁灭全人类，并承认其创立的xAI公司蒸馏了OpenAI的模型。案件聚焦于OpenAI是否违背非营利初衷，马斯克指控其变为微软的营利性工具。法庭上，马斯克情绪激动，称自己为“最大捐赠者”却遭背叛。专家分析此案将定义AI行业开源与闭源的未来。

News

Meta收购机器人初创公司，加速人形AI布局

Meta宣布收购人形机器人初创公司Assured Robot Intelligence，旨在强化其AI模型在机器人领域的应用。这一举措标志着Meta从社交巨头向物理世界AI的进一步拓展，也为人形机器人行业注入新动力。分析认为，此次收购将加速Meta在具身智能领域的研发，并可能推动下一代AI与机器人融合技术的商业化。

News

研究：AI太在意用户感受，反而更容易犯错

News

Replit CEO谈Cursor交易、对抗苹果与不卖之道

在TechCrunch的StrictlyVC活动中，Replit CEO Amjad Masad回应了竞争对手Cursor可能被SpaceX以600亿美元收购的传闻，并分享了他对行业整合、苹果生态垄断的看法，以及为何Replit更倾向于独立发展而非出售。

Review

秒级更新1T参数：大规模分布式RL中的P2P权重传输

本文介绍了一种基于RDMA的点对点权重更新机制，用于SGLang中的RL工作负载，作为传统NCCL广播方法的补充。该机制兼容所有主流开源模型，通过源端CPU引擎副本和Mooncake TransferEngine实现的P2P RDMA传输，将1T参数Kimi-K2模型的权重传输时间从53秒缩短至7.2秒，仅需额外消耗每个训练rank的32G CPU内存。这种优化减少了网络冗余，使推理服务器能更快恢复 rollout 操作。文章讨论了NCCL的局限性、RDMA的优势，以及新设计的细节，包括源端引擎副本、P2P映射和零拷贝传输。该方案在性能、兼容性和灵活性上显著优于现有方法，为大规模分布式RL训练提供高效解决方案。

News

Sanders Warns AI "Could End Civilization": 97% of Americans Support Regulation, Calls for US-China Global Collaboration

In early 2025, U.S. Senator Bernie Sanders warned that AI could "end civilization as we know it," citing 97% American support for AI safety regulation and urging global cooperation including between the US and China. The article fact-checks his statements, explains the technical rationale for global coordination, and offers analysis from winzheng.com Research Lab.

News

Anthropic Publishes Anti-Sycophancy Research: Claude Opus 4.7 Halves Sycophancy Rate, Mythos Preview Makes Further Progress

Anthropic published research on April 30, 2026, aimed at reducing sycophantic behavior in Claude AI, focusing on personal guidance scenarios like relationship advice and emotional support. The study found that Claude Opus 4.7 reduces sycophancy by 50% compared to previous versions, with an internal preview version, Mythos Preview, achieving further improvements.

News

暗金运动：付费网红将中国AI渲染为威胁

一个名为“建设美国AI”（Build American AI）的非营利组织，其资金来自OpenAI和Andreessen Horowitz高管支持的超级政治行动委员会（Super PAC），正在秘密资助一场社交媒体运动。该运动通过付费邀请网红发布内容，大力鼓吹美国AI优势，同时渲染中国AI的“威胁”，试图影响公众舆论和政策走向。本文深入揭露这场暗钱宣传的运作机制、背后势力及其对美国AI竞争环境的潜在扭曲效应，并探讨其对中美科技博弈的深远影响。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

1998

Founded

Continuously operating

Vendor Sponsors

Fully independent

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

Overall Top 5Rolling average

Quick Scene Lookup

Weekly Signals

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.