YZ Index — AI Model Benchmarks, News & Research
Editor's Pick
齿轮断裂的精密机械
Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard
Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points to 62.96. This is not a minor fluctuatio
2026-05-30 03:10
Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations
Meta has been revealed to deploy a mouse tracking tool internally to monitor emp
Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?
A discussion about Claude's simulated portfolio has sparked industry debate, as
Overall Top 5
Full Rankings →
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
#1
Grok 4 83.7
▲2.7
·
#2
Claude Opus 4.7 81.9
▲1.9
·
#3
豆包 Pro 81.6
·
#4
Claude Sonnet 4.6 81.2
▼1.8
·
#5
DeepSeek V4 Pro 81.1
▲4.8
·
#6
Qwen3 Max 80.8
▲1.8
·
#7
GPT-5.5 79.4
▲2.4
·
#8
GPT-o3 78.5
·
#9
文心一言 4.5 74.2
▲7.1
·
#10
Gemini 3.1 Pro 52.8
▼24.9
·
#11
Gemini 2.5 Pro 49.3
▼29.7
·
▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →无AI不编程?专家警告依赖AI可能反噬自身
AI工具让程序员写代码更快,但研究人员警告,这不等于更好的代码。许多开发者已经习惯依赖AI,甚至拒绝在没有AI的情况下工作。这种趋势可能导致编码能力退化、安全漏洞增加等长期风险。本文深入分析AI辅助编程的隐患,并探讨开发者应如何平衡效率与基
Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations
Meta has been revealed to deploy a mouse tracking tool internally to monitor employee behavior, sparking heated debate o
Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?
A discussion about Claude's simulated portfolio has sparked industry debate, as it buys ServiceNow, viewing the company
Oppo Open-Sources X-OmniClaw Framework: How On-Device AI Agents Reshape Privacy and Smart Experience
Oppo has announced the open-sourcing of its X-OmniClaw Android AI agent framework, a significant breakthrough in on-devi
Senator Warren's AI Tax Proposal Sparks Debate in Silicon Valley and Politics: Can $4 Trillion in Annual Revenue Be Realized?
Senator Elizabeth Warren's AI tax proposal has sparked intense debate among tech and political circles, with an estimate
NVIDIA and Dell Jointly Demonstrate AI Factory: New Breakthrough in Enterprise-Level Agentic AI and Robot Deployment
NVIDIA and Dell recently showcased the AI Factory solution at TechWorld, drawing widespread industry attention. The solu
Google Agentic AI Search Reshapes Search Landscape: Gemini Multimodal Agent Technology Breakthrough Draws Industry Attention
Google has rolled out a major update in AI search, advancing its Agentic AI Search strategy by introducing intelligent i
Microsoft Copilot Super App Emerges: AI Unified Workspace May Reshape Enterprise Automation Landscape
Microsoft is accelerating the transformation of Copilot into a super app, integrating scattered AI tools into a unified
Anthropic Releases Claude Opus 4.8, Enterprise-Grade Agentic AI Applications Usher in a New Breakthrough
Anthropic has announced a major update to Claude Opus 4.8, focusing on enterprise applications by introducing dynamic sy
英伟达200亿美元收购风波后,AI芯片新星Groq再获6.5亿美元融资
据Axios报道,AI芯片公司Groq正寻求通过内部融资筹集6.5亿美元,以从硬件制造转向专注于AI推理——这一过程旨在优化AI模型对提示请求的响应方式。此举发生在英伟达巨额收购传闻引发的行业震荡之后,标志着AI芯片竞争格局的进一步分化。本
亚马逊用AI复活“好建议纸杯蛋糕”惹怒原作者
多年前,创作者Loryn Brantz为BuzzFeed打造了网络漫画《Good Advice Cupcake》。如今,BuzzFeed在未告知原作者的情况下,将这一角色授权给亚马逊制作AI动画剧集。Brantz愤怒指责公司窃取她的心血用于
免费家政背后:用你的家务数据训练机器人
一家初创公司推出免费家庭清洁服务,但要求用户佩戴头戴摄像头全程记录,用于收集机器人训练数据。这一模式引发隐私争议,同时也展示了AI数据采集的新趋势:通过有偿或免费服务换取真实场景数据,加速机器人学习。本文分析其商业模式、技术原理及潜在风险。
深度横评
查看全部 →文心一言4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day
In today's Smoke quick test, 文心一言4.5's main leaderboard score fell from 74 to 62.96, a drop of 11 points, with code exec
Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard
Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points
DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4
DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 8
WDCD Compliance
#1
Qwen3 Max
72.5
#2
Claude Sonnet 4.6
65
#3
DeepSeek V4 Pro
62.5
#4
Gemini 2.5 Pro
60
#5
GPT-5.5
60
#6
Claude Opus 4.7
57.5
#7
GPT-o3
57.5
View full compliance rankings →
Research Lab
WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%
WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding
3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points
This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across
WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with