Winzheng — AI Model Benchmarking · Change Intelligence

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across multiple models found gpt-o3 to be the best overall, with an

2026-05-25 07:02

Claude Suddenly Displays Hypnotic Instruction: Multiple Users Advised to Go to Sleep, Alignment Concerns Lurk Behind Anthropic’s Silence

On May 24, numerous users on X platform reported that Anthropic’s Claude model e

DeepSeek Welds V4-Pro's 75% Discount Permanent: A High-Stakes Bet to Reshape Global AI API Pricing Logic

DeepSeek's permanent 75% discount on V4-Pro signals a fundamental shift from tem

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-25 11:10 NF

LQA Agent Reaches 90% Agreement with Human Reviewers: Smartling Bets on AI to Reshape Enterprise Localization

Smartling, a localization software service provider, announced on May 19 what it calls its "largest-ever" update to AI t

News 05-25 11:08 NF

Claude Suddenly Displays Hypnotic Instruction: Multiple Users Advised to Go to Sleep, Alignment Concerns Lurk Behind Anthropic’s Silence

On May 24, numerous users on X platform reported that Anthropic’s Claude model exhibited an anomalous behavior jokingly

News 05-25 11:05 NF

DeepSeek Welds V4-Pro's 75% Discount Permanent: A High-Stakes Bet to Reshape Global AI API Pricing Logic

DeepSeek's permanent 75% discount on V4-Pro signals a fundamental shift from temporary promotion to permanent pricing, e

News 05-25 11:00 NF

Taiwan Launches National AI Strategy Committee: Risk Assessment by July, Industry Regulations by 2028, Asia-Pacific Governance Race Quietly Accelerates

Taiwan has established a National AI Strategy Committee chaired by the Premier, initiating the implementation of the AI

Review 05-25 06:46

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

MLCommons公布第四届 Rising Stars 名单，39位来自全球26所机构的早期研究者从175多名申请者中脱颖而出。入选者研究覆盖大语言模型、ML系统效率、软硬件协同设计、可信AI、多模态学习及医疗、网络安全、科学计算等应用方向

News 05-25 06:03 NF

Modal Labs $355M Series C Funding: 5x ARR Growth Leads Serverless GPU

Modal Labs disclosed a $355 million Series C funding on May 21, 2026, reflecting strong market demand for serverless GPU

News 05-25 06:03 NF

Cohere Open-sources Command A+ 218B MoE Model to Reshape Enterprise Sovereign AI

Cohere has open-sourced Command A+, a 218B-parameter sparse MoE model with 25B active parameters and 128K context length

News 05-25 06:02 NF

US Revokes 90-Day Federal Review Order for Frontier AI Models, Highlighting Regulatory Divergence Among US, China, and EU

The US abruptly withdrew a planned 90-day federal review requirement for frontier AI models hours before its signing, ci

News 05-25 06:00 TC

AI安全实时博弈：连谷歌也在摸着石头过河

我们正处在AI安全过渡期——所有人都一样。无论巨头还是初创，都在实时应对前所未有的挑战。谷歌的安全举措暴露了系统性难题：传统安全框架失效，攻防博弈加速，监管滞后。本文深度解析AI安全现状，探讨行业如何从“被动应急”走向“主动防御”。

Review 05-25 03:10

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b

Review 05-25 03:10

DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day

DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s

Review 05-25 03:10

DeepSeek V4 Pro Tops with 97.08 Points, 文心一言 Execution Score Plunges 27.2 Points

In the latest Smoke Lightweight Benchmark, DeepSeek V4 Pro scored 97.08 to become the only model breaking 97, while 文心一言

深度横评

查看全部 →

Review 05-25

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

Review 05-25

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b

Review 05-25

DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day

DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s

WDCD Compliance

#1 Claude Opus 4.7 65 #2 Claude Sonnet 4.6 62.5 #3 豆包 Pro 60 #4 Gemini 2.5 Pro 57.5 #5 Qwen3 Max 57.5 #6 GPT-o3 55 #7 文心一言 4.5 52.5

View full compliance rankings →

Research Lab

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with

3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points

This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab