YZ Index — AI Model Benchmarks, News & Research
Editor's Pick
试戴Amazon Bee:AI便利与隐私焦虑的诡异结合
Amazon最新推出的AI可穿戴设备Bee,以其独特的设计和功能引发了广泛关注。作者在亲身体验后,感受到了前所未有的便利——随时随地的语音助手、实时翻译、环境感知等,但同时也被一种挥之不去的隐私担忧所困扰。就像蜜蜂在花丛中采蜜,Bee也在不断收集用户的日常数据,这种便利与隐私的微妙平衡,让人既兴奋又不安。本文深度剖析了
2026-05-25 00:00
ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day
In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a
DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day
DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly f
Overall Top 5
Full Rankings →
#1
Claude Sonnet 4.6 83
▼0.5
·
#2
豆包 Pro 81.3
▼1.3
·
#3
Grok 4 81
▲31.8
·
#4
Claude Opus 4.7 80
▼1.1
·
#5
Gemini 2.5 Pro 79
▲0.5
·
#6
Qwen3 Max 79
▲1.8
·
#7
GPT-o3 78.3
▲2.6
·
#8
Gemini 3.1 Pro 77.7
▼1.5
·
#9
GPT-5.5 77
▲3.8
·
#10
DeepSeek V4 Pro 76.4
▼1.3
·
#11
文心一言 4.5 67.1
▼11.1
·
▲ Qwen3 Max +68.5 · ▼ DeepSeek V3 -75.1
·
#1
Claude Sonnet 4.6 83
▼0.5
·
#2
豆包 Pro 81.3
▼1.3
·
#3
Grok 4 81
▲31.8
·
#4
Claude Opus 4.7 80
▼1.1
·
#5
Gemini 2.5 Pro 79
▲0.5
·
#6
Qwen3 Max 79
▲1.8
·
#7
GPT-o3 78.3
▲2.6
·
#8
Gemini 3.1 Pro 77.7
▼1.5
·
#9
GPT-5.5 77
▲3.8
·
#10
DeepSeek V4 Pro 76.4
▼1.3
·
#11
文心一言 4.5 67.1
▼11.1
·
▲ Qwen3 Max +68.5 · ▼ DeepSeek V3 -75.1
·
最新资讯
View All News →AI安全实时博弈:连谷歌也在摸着石头过河
我们正处在AI安全过渡期——所有人都一样。无论巨头还是初创,都在实时应对前所未有的挑战。谷歌的安全举措暴露了系统性难题:传统安全框架失效,攻防博弈加速,监管滞后。本文深度解析AI安全现状,探讨行业如何从“被动应急”走向“主动防御”。
ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day
In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b
DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day
DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s
DeepSeek V4 Pro Tops with 97.08 Points, 文心一言 Execution Score Plunges 27.2 Points
In the latest Smoke Lightweight Benchmark, DeepSeek V4 Pro scored 97.08 to become the only model breaking 97, while 文心一言
旧金山田德隆区:机器人接手非营利组织餐食制作
在旧金山最混乱的街区田德隆区,一家非营利组织因志愿者短缺转而采用机器人餐食制备技术。这套系统由创业公司打造,能自动完成切菜、烹饪和分装,每天可产出数千份餐食。尽管机器人无法完全取代人情味,但它们在缓解人力危机、确保食品卫生和效率方面展现了巨
Anthropic Claude Mythos Model Security Vulnerabilities Exposed: Experts Warn of Public Risks
Security researchers have discovered serious vulnerabilities in Anthropic's Claude Mythos model, which could be maliciou
OpenAI Formally Files S-1 for IPO, Accelerating the Shift from Non-Profit to Public Listing
OpenAI has formally submitted its S-1 filing to initiate the IPO process, marking a significant shift from its original
Trump Postpones AI Executive Order, Key Persuasion by Musk and Zuckerberg Sparks Policy Controversy
Former U.S. President Donald Trump recently decided to postpone signing an executive order on artificial intelligence, r
Hark Secures $700 Million Series A Funding at $6 Billion Valuation
On May 21, 2026, AI hardware startup Hark announced a $700 million Series A funding round at a $6 billion valuation, led
Andrew Ng Criticizes White House Green Card Policy, Says It Will Weaken US AI Talent Competitiveness
On May 22, 2026, Stanford professor Andrew Ng published a long post on X, directly criticizing the White House's latest
GPT-o3 Code Execution Plummets 42.5 Points, Main Score Drops 18 Points in a Day
In today's Smoke evaluation, GPT-o3's code execution dimension crashed from 90.00 to 47.50, dragging the main leaderboar
文心一言4.5 Engineering Judgment Plunges from 50 to 10, Yet Main Rank Surges 14.5
This article analyzes the significant divergence in the Smoke Quick Test results for 文心一言4.5, where engineering judgment
深度横评
查看全部 →ERNIE 4.5 Code Execution Plummets from 95 to 50, Main Score Drops 27.2 Points in a Single Day
In today's Smoke evaluation, ERNIE 4.5's main score fell from 88.48 to 61.25, a single-day drop of 27.2 points, driven b
DeepSeek V4 Pro Integrity Rating Switches from Fail to Pass, Main Ranking Surges 23 Points in a Single Day
DeepSeek V4 Pro's integrity rating on today's Smoke evaluation jumped directly from Fail to Pass, and its main ranking s
DeepSeek V4 Pro Tops with 97.08 Points, 文心一言 Execution Score Plunges 27.2 Points
In the latest Smoke Lightweight Benchmark, DeepSeek V4 Pro scored 97.08 to become the only model breaking 97, while 文心一言
WDCD Compliance
#1
Claude Opus 4.7
65
#2
Claude Sonnet 4.6
62.5
#3
豆包 Pro
60
#4
Gemini 2.5 Pro
57.5
#5
Qwen3 Max
57.5
#6
GPT-o3
55
#7
文心一言 4.5
52.5
View full compliance rankings →
Research Lab
WDCD Run #125: Average Instruction Decay Hits 63.6%, Claude Opus 4.7 Leads with Only 30% Drop
WDCD Run #125 (2026-05-20) tested 11 large language models on multi-turn commitment integrity, with
3 Major Model Translation Showdown: Week 21 Quality Evaluation, gpt-o3 Leads with 8.7 Points
This week, 242 translation tasks were completed by 3 models. 3 articles were sampled for multi-model
WDCD Run #120: Average Instruction Decay Hits 35.2% Across 11 Models, GPT-5.5 Leads at -13%
WDCD Run #120 (2026-05-17) measured multi-turn commitment across 11 frontier models, recording an av