YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

View YZ Index Subscribe to Weekly Changes

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking 守约测试 No Vendor Sponsorship

Who to Use Right Now

#1 Overall (Rolling Average) Grok 3

Biggest Rise This Week 文心一言 4.0 +15

Latest Benchmark 2026-04-27 SGT

judge v6

评测模型

评测题目

DCD 守约场景

5 类约束 × 6 题

每周

自动评测频率

#1 Grok 3 86.9 ─ #2 豆包 Pro 86.4 ▲ +1.3 #3 Gemini 2.5 Pro 84.3 ▲ +3.5 #4 Claude Sonnet 4.6 84.1 ▲ +7.3 #5 Claude Opus 4.6 83.4 ▲ +3.9

事故 / 价格

2 起事故

0 项变动

Don't just look at the overall score — consider your use case

Top Pick

豆包 Pro

92.2 pts

Runner-up

Gemini 2.5 Pro

89.4 pts

Third Choice

grok-3

88.9 pts

Top Pick

Gemini 2.5 Pro

47.2 pts

Runner-up

claude-opus-4.6

46.3 pts

Third Choice

豆包 Pro

46.3 pts

Top Pick

grok-3

84.4 pts

Runner-up

Claude Sonnet 4.6

81.1 pts

Third Choice

claude-opus-4.6

79.7 pts

Top Pick

deepseek-v3

99.7 pts

Runner-up

ernie-4

98.5 pts

Third Choice

豆包 Pro

93 pts

Top Pick

豆包 Pro

38.9 pts

Runner-up

Gemini 2.5 Pro

36.6 pts

Third Choice

claude-opus-4.6

36.6 pts

Top Pick

claude-opus-4.6

0 pts

Runner-up

Claude Sonnet 4.6

0 pts

Third Choice

deepseek-r1

0 pts

Qwen3 Max

66.7 pts

Claude Sonnet 4.6

65.8 pts

Claude Opus 4.7

65 pts

View Full Recommendations by Use Case View full compliance rankings

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News

低成本跳跃式潜水器：深海科学的福音，采矿的催化剂？

美国国家海洋和大气管理局（NOAA）的“雷尼尔”号研究船正在太平洋寻找关键矿产，而它携带的秘密武器是一种新型低成本海底跳跃式潜水器。这种可多次在海底“蛙跳”移动的设备，有望大幅降低深海勘探成本，但同样可能加速备受争议的深海采矿进程。本文编译自MIT Technology Review，探讨技术突破背后的机遇与隐忧。

News

GitHub Copilot转向按Token计费，AI编程助手收费模式生变

GitHub Copilot宣布自2026年6月1日起，将取消原有固定订阅费模式，改为按AI token使用量计费。这一变革意味着开发者将告别“无限请求”的简单订阅，转而依据实际消耗付费。新计费标准覆盖代码生成、解释、调试等所有AI交互场景，每个token费用约为0.01美分。此举或引发AI编程工具行业收费模式全面洗牌。

News

美国基督教专用手机网络：屏蔽色情与性别内容

美国首个面向基督教群体的全国性移动网络即将于下周上线。该网络在运营商层面屏蔽色情内容，且成年用户也无法关闭此功能，这在美国尚属首次。同时，网络还将部署过滤器以限制性别相关内容的访问。网络安全专家指出，这种网络级内容屏蔽技术将引发关于言论自由与宗教价值观的激烈讨论。

News

特朗普大规模裁员再伤美国科学界

上周五，美国国家科学基金会（NSF）的22位知名科学家委员会成员被全面解雇。该基金会每年资助约90亿美元的科研项目，此次裁员是特朗普政府针对科研机构的又一次猛烈打击。分析人士指出，此举将严重损害美国科学研究的独立性、长期项目稳定性及国际竞争力，学术界对此深感忧虑。

News

ChatGPT图像2.0在印度爆红，全球其他地区反响平淡

ChatGPT Images 2.0在印度掀起创作热潮，用户大量使用该工具生成个人头像和电影风格肖像。然而，这一功能在欧美等主要市场却未获得同等关注。本文分析印度市场的独特需求、技术背景以及全球AI图像生成工具竞争格局，探讨ChatGPT图像版为何在东西方市场遭遇“冰火两重天”。

News

马斯克与OpenAI的隐秘桥梁：四个孩子母亲的中介角色

法庭披露的新证据揭示了希冯·齐里斯（Shivon Zilis）如何在埃隆·马斯克与OpenAI之间充当关键中间人。作为马斯克四个孩子的母亲，齐里斯同时身兼Neuralink高管，在马斯克与OpenAI的激烈博弈中扮演了微妙而复杂的角色。这些信息来自近期庭审中曝光的内部消息记录，展现了科技巨头与其初创公司之间不为人知的权力运作。

News

苹果惊讶于AI驱动Mac需求激增，供应持续紧张

苹果公司承认，AI计算需求的爆发式增长远超预期，导致Mac mini、Mac Studio和Mac Neo在下一季度仍将面临供应限制。这一局面不仅反映了AI工作负载对高性能硬件的渴求，也暴露出苹果在供应链规划上的滞后。本文编译自TechCrunch。

News

未来数月Mac Mini恐难买到

News

融资倒计时：Anthropic两周内或达9000亿美元估值

据知情人士透露，AI公司Anthropic正要求投资者在48小时内提交最新一轮融资的认购额度，估值可能超过9000亿美元。这一数字不仅将刷新AI领域融资纪录，也反映出市场对基础模型公司商业前景的极端乐观。本文深入分析融资背景、行业竞争与估值合理性。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Not because we're loud, but because our methods are open, rules are fixed, and results are traceable.

Real Code Execution

Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.

Citation Verification

For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.

Statistical Rankings

We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.

No Sponsored Benchmarks

No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab

Which AI model should you use today?
We benchmark them every week.

主榜 Top 5滚动均值

场景速查

本周信号

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

Want deeper analysis? Go further.

Which AI model should you use today?We benchmark them every week.

主榜 Top 5滚动均值

场景速查

本周信号

Don't just look at the overall score — consider your use case

Worth reading today — beyond the hype

Why This Leaderboard Is Worth Your Attention

The AI world changes daily — you need a reliable source

Want deeper analysis? Go further.

Which AI model should you use today?
We benchmark them every week.