Winzheng — AI Model Benchmarking · Change Intelligence

Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points to 62.96. This is not a minor fluctuatio

2026-05-30 03:10

GitHub Copilot计费改革引发开发者群嘲：'真是个笑话'

微软旗下GitHub Copilot宣布将于2026年6月实施基于token的新计费模式，取代原有的固定订阅制。此举在开发者社区引发强烈不满，被批评为'变相涨价

谷歌AI助手Gemini Spark实测：全天候高效实用

谷歌推出全新AI助手Gemini Spark，声称可全天候协助用户处理日常事务——从邮件摘要到本地活动规划。笔者亲测发现，它确实能有效提升工作效率，但令人困惑的

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-31 02:01 TC

Meta布局AI硬件：智能挂牌或成下一代交互入口

据TechCrunch报道，Meta正在开发一款AI驱动的智能挂牌，该设备可语音交互、实时翻译、识别物体，并与其他Meta设备联动。这标志着Meta在AI硬件领域的最新押注，意图打造轻量化、持续在线的AI助手。行业分析认为，此举将推动可穿戴

News 05-31 02:00 TC

GitHub Copilot计费改革引发开发者群嘲：'真是个笑话'

微软旗下GitHub Copilot宣布将于2026年6月实施基于token的新计费模式，取代原有的固定订阅制。此举在开发者社区引发强烈不满，被批评为'变相涨价'和'扼杀创新'。分析指出，这标志着AI编程助手黄金时代的终结，也暴露了平台方与

News 05-31 00:00 TC

谷歌AI助手Gemini Spark实测：全天候高效实用

谷歌推出全新AI助手Gemini Spark，声称可全天候协助用户处理日常事务——从邮件摘要到本地活动规划。笔者亲测发现，它确实能有效提升工作效率，但令人困惑的是，为何谷歌要将其作为一个独立产品，而非集成到现有服务中？这篇文章将深入分析其功

News 05-30 22:00 TC

浏览器大战升级！2026年挑战Chrome和Safari的五大热门新选择

随着Chrome和Safari长期统治浏览器市场，一批新兴替代者正凭借隐私保护、创新功能和轻量化设计发起冲击。本文编译自TechCrunch最新报道，梳理了Arc、Brave、Vivaldi、Firefox和Edge等五大主流替代浏览器的核

News 05-30 18:00 WD

转录软件要付费？实测告诉你值不值

面对市面上层出不穷的AI转录软件，究竟是每月付费换取高效体验，还是免费工具已足够？WIRED编辑实测了Wispr Flow等多款产品，从准确率、功能、隐私和性价比等角度深入对比，帮助读者做出明智选择。本文编译自WIRED。

News 05-30 08:00 TC

无AI不编程？专家警告依赖AI可能反噬自身

AI工具让程序员写代码更快，但研究人员警告，这不等于更好的代码。许多开发者已经习惯依赖AI，甚至拒绝在没有AI的情况下工作。这种趋势可能导致编码能力退化、安全漏洞增加等长期风险。本文深入分析AI辅助编程的隐患，并探讨开发者应如何平衡效率与基

News 05-30 06:30 X

Meta Employee Mouse Tracking Tool Exposed: Clash Between Remote Work Monitoring and EU Privacy Regulations

Meta has been revealed to deploy a mouse tracking tool internally to monitor employee behavior, sparking heated debate o

News 05-30 06:30 X

Claude Portfolio Bets on ServiceNow Rebound: Are AI Agents Infrastructure Winners or Market Illusions?

A discussion about Claude's simulated portfolio has sparked industry debate, as it buys ServiceNow, viewing the company

News 05-30 06:30 X

Oppo Open-Sources X-OmniClaw Framework: How On-Device AI Agents Reshape Privacy and Smart Experience

Oppo has announced the open-sourcing of its X-OmniClaw Android AI agent framework, a significant breakthrough in on-devi

News 05-30 06:29 X

Senator Warren's AI Tax Proposal Sparks Debate in Silicon Valley and Politics: Can $4 Trillion in Annual Revenue Be Realized?

Senator Elizabeth Warren's AI tax proposal has sparked intense debate among tech and political circles, with an estimate

News 05-30 06:29 X

NVIDIA and Dell Jointly Demonstrate AI Factory: New Breakthrough in Enterprise-Level Agentic AI and Robot Deployment

NVIDIA and Dell recently showcased the AI Factory solution at TechWorld, drawing widespread industry attention. The solu

News 05-30 06:29 X

Google Agentic AI Search Reshapes Search Landscape: Gemini Multimodal Agent Technology Breakthrough Draws Industry Attention

Google has rolled out a major update in AI search, advancing its Agentic AI Search strategy by introducing intelligent i

深度横评

查看全部 →

Review 05-30

文心一言4.5 Code Execution Plummets from 100 to 50, Main Leaderboard Drops 11 Points in a Single Day

In today's Smoke quick test, 文心一言4.5's main leaderboard score fell from 74 to 62.96, a drop of 11 points, with code exec

Review 05-30

Ernie Bot's Execution Score Plummets 50, Smoke Light Test Shakes Up Today's Main Leaderboard

Ernie Bot 4.5's execution score dropped sharply from 100 to 50, causing its main leaderboard score to plummet 11 points

Review 05-29

DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 8

WDCD Compliance

#1 Qwen3 Max 72.5 #2 Claude Sonnet 4.6 65 #3 DeepSeek V4 Pro 62.5 #4 Gemini 2.5 Pro 60 #5 GPT-5.5 60 #6 Claude Opus 4.7 57.5 #7 GPT-o3 57.5

View full compliance rankings →

Research Lab

WDCD Run #135: Qwen3 Max Leads with Only 10% Instruction Decay as Field Average Hits 43.3%

WDCD Run #135 (2026-05-27) evaluated 11 large language models across three dialogue rounds, finding

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab