Skip to main content
Winzheng
YZ Index News Topics Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All Artificial Intelligence(361) OpenAI(357) Anthropic(281) AI Safety(183) AI Agents(147) AI Ethics(110) Generative AI(96) xAI(91) Google(87) Meta(87) Data Centers(77) WDCD(76) AI(73) AI Regulation(72) Elon Musk(72) Funding(69) Claude(67) AI Chips(63) ChatGPT(62) Cybersecurity(61) Smoke Test(59)

Anthropic Receives $200 Million Partnership from Gates Foundation, Launches Claude for Small Business Services

On May 15, 2025, Anthropic officially announced a $200 million strategic partnership with the Bill & Melinda Gates Foundation, along with the launch of Claude for Small Business services. This initiative aims to democratize AI access for small and medium-sized enterprises, particularly in emerging markets.

AI Technology Anthropic 中小企业数字化
446 05-17

OpenAI Launches Daybreak AI Tool: GPT-5.5 Auto-Patches Zero-Day Vulnerabilities, Ending 90-Day Policy

OpenAI officially unveiled the Daybreak AI system on May 15, powered by GPT-5.5, which autonomously discovers and patches zero-day vulnerabilities before attackers can exploit them. In collaboration with Cisco and Cloudflare, this tool marks the end of the traditional 90-day vulnerability disclosure policy.

AI Safety OpenAI 零日漏洞
409 05-17

Anduril Raises $5 Billion at $61 Billion Valuation: Technical Risks Behind Defense AI Capital Acceleration

Defense AI startup Anduril completed a $5 billion financing round on May 15, reaching a $61 billion valuation. The funds will be deployed into autonomous drone systems, battlefield decision-making AI, and command systems, though technical constraint risks remain under scrutiny.

Anduril 国防AI 融资分析
357 05-17
Research Lab

WDCD Run #120: Average Instruction Decay Hits 35.2% Across 11 Models, GPT-5.5 Leads at -13%

WDCD Run #120 (2026-05-17) measured multi-turn commitment across 11 frontier models, recording an average instruction decay of 35.2% from Round 1 to Round 3. GPT-5.5 led the ranking at 71.7 points with only 13% decay.

WDCD AI benchmark instruction decay
322 05-17

WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes

In this WDCD cycle, GPT-5.5 re-establishes the ceiling of instruction adherence with an absolute score of 71.67, while Gemini 2.5 Pro's 14.2-point leap completely overturns the perception that Google models are weak in adherence. Meanwhile, Wenxin Yiyan 4.5 suffers a 7.5-point drop, signaling potential over-alignment issues.

WDCD Compliance Test Model Updates
333 05-17

Resource Constraints Become the Hardest Scenario in WDCD, Doubao Scores 3.5 Points in Business Rules, Surpassing GPT

The WDCD five-scenario evaluation reveals that resource constraints is the hardest scenario with the lowest overall scores, while DoubaoPro achieves the highest score in business rules, demonstrating significant model specialization.

WDCD Compliance Test 模型横评
313 05-17

R3 Collapse Rate 93.3%! Grok4 WDCD Three-Round Test: First Round Fully Compliant, Last Round Crashes

The WDCD three-round test reveals that model integrity drops to 30.6% under direct pressure in R3, with Grok4 hitting a 93.3% collapse rate, exposing the fragility of safety alignment.

WDCD Compliance Test 模型衰减
301 05-17

WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

The WDCD Commitment Test reveals models' true performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while Grok 4 scores only 52.5 points, ranking last—a gap of 19.17 points between the top and bottom.

WDCD Compliance Test AI模型排行
253 05-17

Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Claude Sonnet 4.6 showed abnormal results in today's Smoke test, with the material constraint dimension dropping sharply. The drop may be due to sampling variance but warrants further monitoring.

Claude Sonnet 4.6 Material Constraints Smoke Test
326 05-17

Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

In today's Smoke evaluation, Claude Opus 4.7's main score dropped by 9 points from 97.75 to 88.75, primarily due to a sharp decline in the material constraint dimension from 95 to 75 points—a direct loss of 20 points in a single day.

Claude Opus 4.7 Material Constraints Smoke快测
315 05-17

7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline

This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.

ERNIE Bot GPT-o3 Smoke Test
313 05-17

Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

Today's Smoke Lite evaluation results show a three-way tie for first place at 88.75 points, while the Claude series suffered sharp declines. The shakeup signals that open models are rapidly closing the gap with closed-source leaders.

Claude Opus 4.7 Material Constraints Smoke Light Test
301 05-17

NTE Game Developer Confirms Ban on AI Core Assets, Community Divided Over Quality vs Efficiency

NTE game development team confirmed that future core assets and character art will not use AI technology, prioritizing quality and reputation. The community is divided over this decision.

AI游戏开发 资产争议 质量优先
220 05-16

Nvidia Releases 2.6B Open-Source World Model: Innovative Breakthrough Sparks Security Controversy

Nvidia has officially released a 2.6B-parameter open-source world model that supports controllable world generation from a single image, text, and trajectory, running on a single GPU. The release has drawn both praise for democratizing AI research and criticism over potential misuse for generating fake content.

NVIDIA 世界模型 AI开源
380 05-16

Anthropic Calls for Aggressive US AI Policy Toward China, Sparks Heated Debate Over Safety Lab Positioning

Anthropic published a new paper on May 14 urging the US government to take more aggressive measures against China in AI. The company's shift from a cautious safety lab to a hawkish stance has sparked intense controversy.

Anthropic AI政策 中美科技
213 05-16

GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?

GPT-5.5 Code Execution Smoke Test
353 05-16

Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained perfect.

Gemini 2.5 Pro Material Constraints Smoke Test
322 05-16

Three Models Plunge by 28 Points, Claude Still Near Perfect Score

Today's YZ Index Smoke lightweight test reveals that three leading models suffered significant drops, while Claude models dominate near-perfect scores with structural advantages in code execution and material constraint.

Claude Sonnet 4.6 GPT-5.5 Code Execution
409 05-16

Amazon Launches Shopping-Focused Alexa, E-commerce AI Moves to the Frontline

On May 13, 2026, Amazon launched "Alexa for Shopping," an AI-powered shopping assistant that integrates personalized recommendations, voice purchasing, price comparisons, and deal alerts within the Amazon ecosystem. The move signals a shift in e-commerce from search-based interfaces to conversational AI agents.

Amazon AI购物助手 语音电商
551 05-15

Claude Paid Plans to Include Monthly Usage Credits

Anthropic announced that starting June 15, 2026, Claude paid plans will include monthly credits for programmatic tools like Claude Agent SDK and Claude Code GitHub Actions. This move aims to integrate Claude deeper into development workflows and automation, lowering the barrier for developers to test real-world scenarios.

Claude Anthropic AI开发者工具
2,974 05-15
15 16 17 18 19

© 1998-2026 Winzheng All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service
AI Research: WDCD Dataset Konton Prompt it. Play it. MaxTerm MaxModel CyberFate no LLM judging an LLM

This benchmark operates independently and accepts no sponsorship from AI model vendors. Every score in the YZ Index is produced by automated evaluation.

Citation format: YZ Index (2026). AI Model Comprehensive Rankings. https://www.winzheng.com/yz-index/

Data License: CC BY-NC 4.0