YZ Index — AI Model Benchmarks, News & Research
Overall Top 5
Full Rankings →
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →Meta布局AI硬件:智能挂牌或成下一代交互入口
据TechCrunch报道,Meta正在开发一款AI驱动的智能挂牌,该设备可语音交互、实时翻译、识别物体,并与其他Meta设备联动。这标志着Meta在AI硬件领域的最新押注,意图打造轻量化、持续在线的AI助手。行业分析认为,此举将推动可穿戴
GitHub Copilot计费改革引发开发者群嘲:'真是个笑话'
微软旗下GitHub Copilot宣布将于2026年6月实施基于token的新计费模式,取代原有的固定订阅制。此举在开发者社区引发强烈不满,被批评为'变相涨价'和'扼杀创新'。分析指出,这标志着AI编程助手黄金时代的终结,也暴露了平台方与
谷歌AI助手Gemini Spark实测:全天候高效实用
谷歌推出全新AI助手Gemini Spark,声称可全天候协助用户处理日常事务——从邮件摘要到本地活动规划。笔者亲测发现,它确实能有效提升工作效率,但令人困惑的是,为何谷歌要将其作为一个独立产品,而非集成到现有服务中?这篇文章将深入分析其功
浏览器大战升级!2026年挑战Chrome和Safari的五大热门新选择
随着Chrome和Safari长期统治浏览器市场,一批新兴替代者正凭借隐私保护、创新功能和轻量化设计发起冲击。本文编译自TechCrunch最新报道,梳理了Arc、Brave、Vivaldi、Firefox和Edge等五大主流替代浏览器的核
转录软件要付费?实测告诉你值不值
面对市面上层出不穷的AI转录软件,究竟是每月付费换取高效体验,还是免费工具已足够?WIRED编辑实测了Wispr Flow等多款产品,从准确率、功能、隐私和性价比等角度深入对比,帮助读者做出明智选择。本文编译自WIRED。
无AI不编程?专家警告依赖AI可能反噬自身
AI工具让程序员写代码更快,但研究人员警告,这不等于更好的代码。许多开发者已经习惯依赖AI,甚至拒绝在没有AI的情况下工作。这种趋势可能导致编码能力退化、安全漏洞增加等长期风险。本文深入分析AI辅助编程的隐患,并探讨开发者应如何平衡效率与基
Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations
Meta has been revealed to deploy a mouse tracking tool internally to monitor employee behavior, sparking heated debate o
Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?
A discussion about Claude's simulated portfolio has sparked industry debate, as it buys ServiceNow, viewing the company
Oppo Open-Sources X-OmniClaw Framework: How On-Device AI Agents Reshape Privacy and Smart Experience
Oppo has announced the open-sourcing of its X-OmniClaw Android AI agent framework, a significant breakthrough in on-devi
Senator Warren's AI Tax Proposal Sparks Debate in Silicon Valley and Politics: Can $4 Trillion in Annual Revenue Be Realized?
Senator Elizabeth Warren's AI tax proposal has sparked intense debate among tech and political circles, with an estimate
NVIDIA and Dell Jointly Demonstrate AI Factory: New Breakthrough in Enterprise-Level Agentic AI and Robot Deployment
NVIDIA and Dell recently showcased the AI Factory solution at TechWorld, drawing widespread industry attention. The solu
Google Agentic AI Search Reshapes Search Landscape: Gemini Multimodal Agent Technology Breakthrough Draws Industry Attention
Google has rolled out a major update in AI search, advancing its Agentic AI Search strategy by introducing intelligent i
深度横评
查看全部 →文心一言4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day
In today's Smoke quick test, 文心一言4.5's main leaderboard score fell from 74 to 62.96, a drop of 11 points, with code exec
Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard
Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points
DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4
DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 8
WDCD Compliance
#1
Qwen3 Max
72.5
#2
Claude Sonnet 4.6
65
#3
DeepSeek V4 Pro
62.5
#4
Gemini 2.5 Pro
60
#5
GPT-5.5
60
#6
Claude Opus 4.7
57.5
#7
GPT-o3
57.5
View full compliance rankings →
Research Lab
WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across
WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with