Winzheng — AI Model Benchmarking · Change Intelligence

Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points to 62.96. This is not a minor fluctuatio

2026-05-30 03:10

Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations

Meta has been revealed to deploy a mouse tracking tool internally to monitor emp

Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?

A discussion about Claude's simulated portfolio has sparked industry debate, as

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-30 08:00 TC

无AI不编程？专家警告依赖AI可能反噬自身

AI工具让程序员写代码更快，但研究人员警告，这不等于更好的代码。许多开发者已经习惯依赖AI，甚至拒绝在没有AI的情况下工作。这种趋势可能导致编码能力退化、安全漏洞增加等长期风险。本文深入分析AI辅助编程的隐患，并探讨开发者应如何平衡效率与基

News 05-30 06:30 X

Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations

Meta has been revealed to deploy a mouse tracking tool internally to monitor employee behavior, sparking heated debate o

News 05-30 06:30 X

Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?

A discussion about Claude's simulated portfolio has sparked industry debate, as it buys ServiceNow, viewing the company

News 05-30 06:30 X

Oppo Open-Sources X-OmniClaw Framework: How On-Device AI Agents Reshape Privacy and Smart Experience

Oppo has announced the open-sourcing of its X-OmniClaw Android AI agent framework, a significant breakthrough in on-devi

News 05-30 06:29 X

Senator Warren's AI Tax Proposal Sparks Debate in Silicon Valley and Politics: Can $4 Trillion in Annual Revenue Be Realized?

Senator Elizabeth Warren's AI tax proposal has sparked intense debate among tech and political circles, with an estimate

News 05-30 06:29 X

NVIDIA and Dell Jointly Demonstrate AI Factory: New Breakthrough in Enterprise-Level Agentic AI and Robot Deployment

NVIDIA and Dell recently showcased the AI Factory solution at TechWorld, drawing widespread industry attention. The solu

News 05-30 06:29 X

Google Agentic AI Search Reshapes Search Landscape: Gemini Multimodal Agent Technology Breakthrough Draws Industry Attention

Google has rolled out a major update in AI search, advancing its Agentic AI Search strategy by introducing intelligent i

News 05-30 06:28 X

Microsoft Copilot Super App Emerges: AI Unified Workspace May Reshape Enterprise Automation Landscape

Microsoft is accelerating the transformation of Copilot into a super app, integrating scattered AI tools into a unified

News 05-30 06:28 X

Anthropic Releases Claude Opus 4.8, Enterprise-Grade Agentic AI Applications Usher in a New Breakthrough

Anthropic has announced a major update to Claude Opus 4.8, focusing on enterprise applications by introducing dynamic sy

News 05-30 06:01 TC

英伟达200亿美元收购风波后，AI芯片新星Groq再获6.5亿美元融资

据Axios报道，AI芯片公司Groq正寻求通过内部融资筹集6.5亿美元，以从硬件制造转向专注于AI推理——这一过程旨在优化AI模型对提示请求的响应方式。此举发生在英伟达巨额收购传闻引发的行业震荡之后，标志着AI芯片竞争格局的进一步分化。本

News 05-30 06:00 WD

亚马逊用AI复活“好建议纸杯蛋糕”惹怒原作者

多年前，创作者Loryn Brantz为BuzzFeed打造了网络漫画《Good Advice Cupcake》。如今，BuzzFeed在未告知原作者的情况下，将这一角色授权给亚马逊制作AI动画剧集。Brantz愤怒指责公司窃取她的心血用于

News 05-30 04:02 ARS

免费家政背后：用你的家务数据训练机器人

一家初创公司推出免费家庭清洁服务，但要求用户佩戴头戴摄像头全程记录，用于收集机器人训练数据。这一模式引发隐私争议，同时也展示了AI数据采集的新趋势：通过有偿或免费服务换取真实场景数据，加速机器人学习。本文分析其商业模式、技术原理及潜在风险。

深度横评

查看全部 →

Review 05-30

文心一言4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

In today's Smoke quick test, 文心一言4.5's main leaderboard score fell from 74 to 62.96, a drop of 11 points, with code exec

Review 05-30

Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard

Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points

Review 05-29

DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 8

WDCD Compliance

#1 Qwen3 Max 72.5 #2 Claude Sonnet 4.6 65 #3 DeepSeek V4 Pro 62.5 #4 Gemini 2.5 Pro 60 #5 GPT-5.5 60 #6 Claude Opus 4.7 57.5 #7 GPT-o3 57.5

View full compliance rankings →

Research Lab

WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%

WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab