YZ Index · AI Model Change Intelligence

Which AI model should you use today?
We benchmark them every week.

11 models · 212 questions randomly sampled · Real code execution · Citation verification · Rolling average rankings · Don't trust press releases, check continuous performance.

Code Sandbox Execution Citation Accuracy Check Statistical Significance Ranking 守约测试 No Vendor Sponsorship
Who to Use Right Now
#1 Overall (Rolling Average) Grok 3
Biggest Rise This Week 文心一言 4.0 +15
Latest Benchmark 2026-04-27 SGT
judge v6
0
评测模型
0
评测题目
0
DCD 守约场景
5 类约束 × 6 题
每周
自动评测频率

Don't just look at the overall score — consider your use case

Top Pick
豆包 Pro
92.2 pts
Runner-up
Gemini 2.5 Pro
89.4 pts
Third Choice
grok-3
88.9 pts
Top Pick
Gemini 2.5 Pro
47.2 pts
Runner-up
claude-opus-4.6
46.3 pts
Third Choice
豆包 Pro
46.3 pts
Top Pick
grok-3
84.4 pts
Runner-up
Claude Sonnet 4.6
81.1 pts
Third Choice
claude-opus-4.6
79.7 pts
Top Pick
deepseek-v3
99.7 pts
Runner-up
ernie-4
98.5 pts
Third Choice
豆包 Pro
93 pts
Top Pick
豆包 Pro
38.9 pts
Runner-up
Gemini 2.5 Pro
36.6 pts
Third Choice
claude-opus-4.6
36.6 pts
Top Pick
claude-opus-4.6
0 pts
Runner-up
Claude Sonnet 4.6
0 pts
Third Choice
deepseek-r1
0 pts
Qwen3 Max
66.7 pts
Claude Sonnet 4.6
65.8 pts
Claude Opus 4.7
65 pts

View Full Recommendations by Use Case

Worth reading today — beyond the hype

We only feature content that impacts capability, pricing, stability, or model selection.

News
AI规模化下的数据主权:企业如何掌控自己的数据
企业在追求AI定制化的过程中,正积极掌控自身数据。但如何平衡数据所有权与高质量数据的安全流动,成为关键挑战。MIT Technology Review的EmTech AI会议探讨了AI工厂如何解锁新层次的规模、可持续性和治理,为数据驱动洞察铺平道路。
News
GPT-5.5在网安测试中追平神话预览版
最新网络安全测试结果显示,GPT-5.5与备受瞩目的Mythos Preview(神话预览版)在多项关键指标上不相上下。专家指出,这打破了此前关于Mythos的网络安全能力是“单一模型颠覆性突破”的论断,暗示AI威胁防御的竞争正趋于均衡。测试涵盖渗透测试、漏洞识别与攻击模拟等核心场景。
News
基督徒专属手机网络:屏蔽色情与性别内容,LLM调试新思路
美国一家新手机网络瞄准基督徒用户,自动屏蔽色情与性别相关内容,引发言论自由争议。与此同时,大语言模型调试技术迎来突破,两者在内容过滤与模型校准上异曲同工。本文编译自MIT Technology Review,深度解析技术如何重塑信仰与AI的边界。
News
SAP:企业AI治理如何保障利润空间
SAP指出,消费级AI模型在关键业务任务中常出现10%的误差,导致利润流失。企业AI治理通过将统计猜测转化为确定性控制,重新定义了利润率保障机制。SAP全球客户成功总裁Manos Raptopoulos强调,只有通过严格的治理框架,企业才能将AI从“概率玩具”升级为“利润引擎”。本文深入解析企业AI治理的核心逻辑、实施路径与商业价值。
News
低成本跳跃式潜水器:深海科学的福音,采矿的催化剂?
美国国家海洋和大气管理局(NOAA)的“雷尼尔”号研究船正在太平洋寻找关键矿产,而它携带的秘密武器是一种新型低成本海底跳跃式潜水器。这种可多次在海底“蛙跳”移动的设备,有望大幅降低深海勘探成本,但同样可能加速备受争议的深海采矿进程。本文编译自MIT Technology Review,探讨技术突破背后的机遇与隐忧。
News
GitHub Copilot转向按Token计费,AI编程助手收费模式生变
GitHub Copilot宣布自2026年6月1日起,将取消原有固定订阅费模式,改为按AI token使用量计费。这一变革意味着开发者将告别“无限请求”的简单订阅,转而依据实际消耗付费。新计费标准覆盖代码生成、解释、调试等所有AI交互场景,每个token费用约为0.01美分。此举或引发AI编程工具行业收费模式全面洗牌。
News
美国基督教专用手机网络:屏蔽色情与性别内容
美国首个面向基督教群体的全国性移动网络即将于下周上线。该网络在运营商层面屏蔽色情内容,且成年用户也无法关闭此功能,这在美国尚属首次。同时,网络还将部署过滤器以限制性别相关内容的访问。网络安全专家指出,这种网络级内容屏蔽技术将引发关于言论自由与宗教价值观的激烈讨论。
News
特朗普大规模裁员再伤美国科学界
上周五,美国国家科学基金会(NSF)的22位知名科学家委员会成员被全面解雇。该基金会每年资助约90亿美元的科研项目,此次裁员是特朗普政府针对科研机构的又一次猛烈打击。分析人士指出,此举将严重损害美国科学研究的独立性、长期项目稳定性及国际竞争力,学术界对此深感忧虑。
News
ChatGPT图像2.0在印度爆红,全球其他地区反响平淡
ChatGPT Images 2.0在印度掀起创作热潮,用户大量使用该工具生成个人头像和电影风格肖像。然而,这一功能在欧美等主要市场却未获得同等关注。本文分析印度市场的独特需求、技术背景以及全球AI图像生成工具竞争格局,探讨ChatGPT图像版为何在东西方市场遭遇“冰火两重天”。

Not all AI news is worth reading. What matters is what changes your judgment. View All News

Why This Leaderboard Is Worth Your Attention

Not because we're loud, but because our methods are open, rules are fixed, and results are traceable.

Real Code Execution
Looking like it can code isn't enough. We run the code in a sandbox. If it doesn't pass, it's zero.
Citation Verification
For long-document questions, we don't just check if the answer looks right — we verify citations trace back to the source.
Statistical Rankings
We don't judge on a single run. Rankings are based on rolling averages, avoiding luck-driven fluctuations.
No Sponsored Benchmarks
No co-evaluations, no pre-test consultations, no favoritism. Whatever the results are, that's what we publish.

View Methodology

The AI world changes daily — you need a reliable source

3 curated picks daily, weekly index changes, instant alerts for incidents and price shifts. Free, no ads, unsubscribe anytime.

  • Daily Picks — From the flood of AI news, we pick the 3 that truly matter
  • YZ Index Weekly — Who's up, who's down — one email covers it all
  • Model Incident Alerts — When a model you use has an issue, know immediately
  • Price Change Notifications — API price changes — don't find out from the bill
Free | No Ads | No Sponsored Content | Unsubscribe Anytime

Want deeper analysis? Go further.

The leaderboard answers "who's stronger." Research Lab answers "why." Model safety, edge deployment, performance teardowns — not rehashing papers, but conclusions from our own testing.

Enter Research Lab