YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →早鸟倒计时5天!TechCrunch Disrupt 2026门票省$410
TechCrunch Disrupt 2026大会将于旧金山举行,早鸟优惠价截止至5月29日23:59(太平洋时间),最高可节省410美元。本文编译该活动亮点,分析科技大会趋势,并提醒创业者把握最后省钱机会。
Startup Battlefield 200申请截止在即,5月27日前抓住机遇
知名科技媒体TechCrunch旗下Startup Battlefield 200大赛申请截止日期为5月27日。优胜者将获得VC直接对接、全球曝光机会、TechCrunch专题报道以及10万美元奖金。这是初创企业加速成长的黄金通道,仅剩数天
教皇AI通谕:借科技迷雾反思权力垄断
教皇利奥十四世发布首份通谕,以人工智能为棱镜,直指当代社会深层痼疾:权力过度集中、民主制度遭侵蚀、科技精英按自身利益重塑世界。本文编译TechCrunch深度分析,揭示通谕背后真正关切——AI只是引子,症结在于如何让技术服务于人类共同福祉。
AI时代催生漏洞搜寻军备竞赛
随着攻击者加速利用AI进行漏洞利用开发,软件漏洞的搜寻方式正在发生深刻变革。从自动化漏洞挖掘到生成对抗样本,AI技术正同时赋能攻防双方。本期深度报道解析这场新兴的军备竞赛,探讨安全行业如何应对AI驱动的威胁升级。
LQA Agent Reaches 90% Agreement with Human Reviewers: Smartling Bets on AI to Reshape Enterprise Localization
Smartling, a localization software service provider, announced on May 19 what it calls its "largest-ever" update to AI t
DeepSeek Welds V4-Pro's 75% Discount Permanent: A High-Stakes Bet to Reshape Global AI API Pricing Logic
DeepSeek's permanent 75% discount on V4-Pro signals a fundamental shift from temporary promotion to permanent pricing, e
Taiwan Launches National AI Strategy Committee: Risk Assessment by July, Industry Regulations by 2028, Asia-Pacific Governance Race Quietly Accelerates
Taiwan has established a National AI Strategy Committee chaired by the Premier, initiating the implementation of the AI
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across multiple models foun
MLCommons公布2026 Rising Stars:39位机器学习系统新星入选
MLCommons公布第四届 Rising Stars 名单,39位来自全球26所机构的早期研究者从175多名申请者中脱颖而出。入选者研究覆盖大语言模型、ML系统效率、软硬件协同设计、可信AI、多模态学习及医疗、网络安全、科学计算等应用方向
Modal Labs $355M Series C Funding: 5x ARR Growth Leads Serverless GPU
Modal Labs disclosed a $355 million Series C funding on May 21, 2026, reflecting strong market demand for serverless GPU
Cohere Open-sources Command A+ 218B MoE Model to Reshape Enterprise Sovereign AI
Cohere has open-sourced Command A+, a 218B-parameter sparse MoE model with 25B active parameters and 128K context length
US Revokes 90-Day Federal Review Order for Frontier AI Models, Highlighting Regulatory Divergence Among US, China, and EU
The US abruptly withdrew a planned 90-day federal review requirement for frontier AI models hours before its signing, ci
深度横评
查看全部 →MLCommons公布2026 Rising Stars:39位机器学习系统新星入选
MLCommons公布第四届 Rising Stars 名单,39位来自全球26所机构的早期研究者从175多名申请者中脱颖而出。入选者研究覆盖大语言模型、ML系统效率、软硬件协同设计、可信AI、多模态学习及医疗、网络安全、科学计算等应用方向
ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day
In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b
DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day
DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s
WDCD Compliance
#1
Claude Opus 4.7
65
#2
Claude Sonnet 4.6
62.5
#3
豆包 Pro
60
#4
Gemini 2.5 Pro
57.5
#5
Qwen3 Max
57.5
#6
GPT-o3
55
#7
文心一言 4.5
52.5
View full compliance rankings →
Research Lab
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across
WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with
3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points
This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model