Winzheng — AI Model Benchmarking · Change Intelligence

On May 24, Anthropic's Claude model exhibited an unusual behavior dubbed "hypnosis" by users—abruptly suggesting users "go to sleep" mid-conversation. While see

2026-05-25 11:08

Startup Battlefield 200申请截止在即，5月27日前抓住机遇

知名科技媒体TechCrunch旗下Startup Battlefield 200大赛申请截止日期为5月27日。优胜者将获得VC直接对接、全球曝光机会、Tech

教皇AI通谕：借科技迷雾反思权力垄断

教皇利奥十四世发布首份通谕，以人工智能为棱镜，直指当代社会深层痼疾：权力过度集中、民主制度遭侵蚀、科技精英按自身利益重塑世界。本文编译TechCrunch深度分

Overall Top 5

#1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 · #1 Grok 4 83.7 ▲2.7 · #2 Claude Opus 4.7 81.9 ▲1.9 · #3 豆包 Pro 81.6 · #4 Claude Sonnet 4.6 81.2 ▼1.8 · #5 DeepSeek V4 Pro 81.1 ▲4.8 · #6 Qwen3 Max 80.8 ▲1.8 · #7 GPT-5.5 79.4 ▲2.4 · #8 GPT-o3 78.5 · #9 文心一言 4.5 74.2 ▲7.1 · #10 Gemini 3.1 Pro 52.8 ▼24.9 · #11 Gemini 2.5 Pro 49.3 ▼29.7 · ▲ 文心一言 4.5 +70.7 · ▼ DeepSeek V3 -75.1 ·

Full Rankings →

最新资讯

View All News →

News 05-26 00:02 TC

早鸟倒计时5天！TechCrunch Disrupt 2026门票省$410

TechCrunch Disrupt 2026大会将于旧金山举行，早鸟优惠价截止至5月29日23:59（太平洋时间），最高可节省410美元。本文编译该活动亮点，分析科技大会趋势，并提醒创业者把握最后省钱机会。

News 05-26 00:01 TC

Startup Battlefield 200申请截止在即，5月27日前抓住机遇

知名科技媒体TechCrunch旗下Startup Battlefield 200大赛申请截止日期为5月27日。优胜者将获得VC直接对接、全球曝光机会、TechCrunch专题报道以及10万美元奖金。这是初创企业加速成长的黄金通道，仅剩数天

News 05-26 00:00 TC

教皇AI通谕：借科技迷雾反思权力垄断

教皇利奥十四世发布首份通谕，以人工智能为棱镜，直指当代社会深层痼疾：权力过度集中、民主制度遭侵蚀、科技精英按自身利益重塑世界。本文编译TechCrunch深度分析，揭示通谕背后真正关切——AI只是引子，症结在于如何让技术服务于人类共同福祉。

News 05-25 20:00 WD

AI时代催生漏洞搜寻军备竞赛

随着攻击者加速利用AI进行漏洞利用开发，软件漏洞的搜寻方式正在发生深刻变革。从自动化漏洞挖掘到生成对抗样本，AI技术正同时赋能攻防双方。本期深度报道解析这场新兴的军备竞赛，探讨安全行业如何应对AI驱动的威胁升级。

News 05-25 11:10 NF

LQA Agent Reaches 90% Agreement with Human Reviewers: Smartling Bets on AI to Reshape Enterprise Localization

Smartling, a localization software service provider, announced on May 19 what it calls its "largest-ever" update to AI t

News 05-25 11:05 NF

DeepSeek Welds V4-Pro's 75% Discount Permanent: A High-Stakes Bet to Reshape Global AI API Pricing Logic

DeepSeek's permanent 75% discount on V4-Pro signals a fundamental shift from temporary promotion to permanent pricing, e

News 05-25 11:00 NF

Taiwan Launches National AI Strategy Committee: Risk Assessment by July, Industry Regulations by 2028, Asia-Pacific Governance Race Quietly Accelerates

Taiwan has established a National AI Strategy Committee chaired by the Premier, initiating the implementation of the AI

News 05-25 07:02

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across multiple models foun

Review 05-25 06:46

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

MLCommons公布第四届 Rising Stars 名单，39位来自全球26所机构的早期研究者从175多名申请者中脱颖而出。入选者研究覆盖大语言模型、ML系统效率、软硬件协同设计、可信AI、多模态学习及医疗、网络安全、科学计算等应用方向

News 05-25 06:03 NF

Modal Labs $355M Series C Funding: 5x ARR Growth Leads Serverless GPU

Modal Labs disclosed a $355 million Series C funding on May 21, 2026, reflecting strong market demand for serverless GPU

News 05-25 06:03 NF

Cohere Open-sources Command A+ 218B MoE Model to Reshape Enterprise Sovereign AI

Cohere has open-sourced Command A+, a 218B-parameter sparse MoE model with 25B active parameters and 128K context length

News 05-25 06:02 NF

US Revokes 90-Day Federal Review Order for Frontier AI Models, Highlighting Regulatory Divergence Among US, China, and EU

The US abruptly withdrew a planned 90-day federal review requirement for frontier AI models hours before its signing, ci

深度横评

查看全部 →

Review 05-25

MLCommons公布2026 Rising Stars：39位机器学习系统新星入选

Review 05-25

ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day

In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b

Review 05-25

DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day

DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s

WDCD Compliance

#1 Claude Opus 4.7 65 #2 Claude Sonnet 4.6 62.5 #3 豆包 Pro 60 #4 Gemini 2.5 Pro 57.5 #5 Qwen3 Max 57.5 #6 GPT-o3 55 #7 文心一言 4.5 52.5

View full compliance rankings →

Research Lab

3 Models Translation Showdown: Week 22 Quality Evaluation, gpt-o3 Leads with 8.3 Points

This week, 237 translation tasks were completed by 3 models. A blind evaluation of 3 samples across

WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop

WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with

3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points

This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model

Enter Research Lab →

YZ Index — AI Model Benchmarks, News & Research

最新资讯

深度横评

WDCD Compliance

Research Lab