Winzheng — AI Model Benchmarking · Change Intelligence

On May 24, Anthropic's Claude model exhibited an unusual behavior dubbed "hypnosis" by users—abruptly suggesting users "go to sleep" mid-conversation. While see

2026-05-25 11:08

LQA Agent Reaches 90% Agreement with Human Reviewers: Smartling Bets on AI to Reshape Enterprise Localization

Smartling, a localization software service provider, announced on May 19 what it

Claude's Sudden "Hypnotic" Instructions: Multiple Users Advised to Go to Sleep, Alignment Concerns Behind Anthropic's Silence

On May 24, Anthropic's Claude model exhibited an unusual behavior dubbed "hypnos

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-25 20:00 WD

AI时代催生漏洞搜寻军备竞赛

随着攻击者加速利用AI进行漏洞利用开发，软件漏洞的搜寻方式正在发生深刻变革。从自动化漏洞挖掘到生成对抗样本，AI技术正同时赋能攻防双方。本期深度报道解析这场新兴的军备竞赛，探讨安全行业如何应对AI驱动的威胁升级。

News 05-25 11:10 NF

LQA Agent Reaches 90% Agreement with Human Reviewers: Smartling Bets on AI to Reshape Enterprise Localization

Smartling, a localization software service provider, announced on May 19 what it calls its "largest-ever" update to AI t

News 05-25 11:05 NF

DeepSeek Welds V4-Pro's 75% Discount Permanent: A High-Stakes Bet to Reshape Global AI API Pricing Logic

DeepSeek's permanent 75% discount on V4-Pro signals a fundamental shift from temporary promotion to permanent pricing, e

News 05-25 11:00 NF

Taiwan Launches National AI Strategy Committee: Risk Assessment by July, Industry Regulations by 2028, Asia-Pacific Governance Race Quietly Accelerates

Taiwan has established a National AI Strategy Committee chaired by the Premier, initiating the implementation of the AI

News 05-25 07:02

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across multiple models foun

Review 05-25 06:46

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

MLCommons公布第四届 Rising Stars 名单，39位来自全球26所机构的早期研究者从175多名申请者中脱颖而出。入选者研究覆盖大语言模型、ML系统效率、软硬件协同设计、可信AI、多模态学习及医疗、网络安全、科学计算等应用方向

News 05-25 06:03 NF

Modal Labs $355M Series C Funding: 5x ARR Growth Leads Serverless GPU

Modal Labs disclosed a $355 million Series C funding on May 21, 2026, reflecting strong market demand for serverless GPU

News 05-25 06:03 NF

Cohere Open-sources Command A+ 218B MoE Model to Reshape Enterprise Sovereign AI

Cohere has open-sourced Command A+, a 218B-parameter sparse MoE model with 25B active parameters and 128K context length

News 05-25 06:02 NF

US Revokes 90-Day Federal Review Order for Frontier AI Models, Highlighting Regulatory Divergence Among US, China, and EU

The US abruptly withdrew a planned 90-day federal review requirement for frontier AI models hours before its signing, ci

News 05-25 06:00 TC

AI安全实时博弈：连谷歌也在摸着石头过河

我们正处在AI安全过渡期——所有人都一样。无论巨头还是初创，都在实时应对前所未有的挑战。谷歌的安全举措暴露了系统性难题：传统安全框架失效，攻防博弈加速，监管滞后。本文深度解析AI安全现状，探讨行业如何从“被动应急”走向“主动防御”。

Review 05-25 03:10

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b

Review 05-25 03:10

DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day

DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s

深度横评

查看全部 →

Review 05-25

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

Review 05-25

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b

Review 05-25

DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day

DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s

WDCD Compliance

#1 Claude Opus 4.7 65 #2 Claude Sonnet 4.6 62.5 #3 豆包 Pro 60 #4 Gemini 2.5 Pro 57.5 #5 Qwen3 Max 57.5 #6 GPT-o3 55 #7 文心一言 4.5 52.5

View full compliance rankings →

Research Lab

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with

3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points

This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab