Original AI News | Winzheng

Anthropic Receives $200 Million Partnership from Gates Foundation, Launches Claude for Small Business Services

On May 15, 2025, Anthropic officially announced a $200 million strategic partnership with the Bill & Melinda Gates Foundation, along with the launch of Claude for Small Business services. This initiative aims to democratize AI access for small and medium-sized enterprises, particularly in emerging markets.

OpenAI Launches Daybreak AI Tool: GPT-5.5 Auto-Patches Zero-Day Vulnerabilities, Ending 90-Day Policy

OpenAI officially unveiled the Daybreak AI system on May 15, powered by GPT-5.5, which autonomously discovers and patches zero-day vulnerabilities before attackers can exploit them. In collaboration with Cisco and Cloudflare, this tool marks the end of the traditional 90-day vulnerability disclosure policy.

Anduril Raises $5 Billion at $61 Billion Valuation: Technical Risks Behind Defense AI Capital Acceleration

Defense AI startup Anduril completed a $5 billion financing round on May 15, reaching a $61 billion valuation. The funds will be deployed into autonomous drone systems, battlefield decision-making AI, and command systems, though technical constraint risks remain under scrutiny.

Research Lab

WDCD Run #120: Average Instruction Decay Hits 35.2% Across 11 Models, GPT-5.5 Leads at -13%

WDCD Run #120 (2026-05-17) measured multi-turn commitment across 11 frontier models, recording an average instruction decay of 35.2% from Round 1 to Round 3. GPT-5.5 led the ranking at 71.7 points with only 13% decay.

WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes

In this WDCD cycle, GPT-5.5 re-establishes the ceiling of instruction adherence with an absolute score of 71.67, while Gemini 2.5 Pro's 14.2-point leap completely overturns the perception that Google models are weak in adherence. Meanwhile, Wenxin Yiyan 4.5 suffers a 7.5-point drop, signaling potential over-alignment issues.

Resource Constraints Become the Hardest Scenario in WDCD, Doubao Scores 3.5 Points in Business Rules, Surpassing GPT

The WDCD five-scenario evaluation reveals that resource constraints is the hardest scenario with the lowest overall scores, while DoubaoPro achieves the highest score in business rules, demonstrating significant model specialization.

R3 Collapse Rate 93.3%! Grok4 WDCD Three-Round Test: First Round Fully Compliant, Last Round Crashes

The WDCD three-round test reveals that model integrity drops to 30.6% under direct pressure in R3, with Grok4 hitting a 93.3% collapse rate, exposing the fragility of safety alignment.

WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

The WDCD Commitment Test reveals models' true performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while Grok 4 scores only 52.5 points, ranking last—a gap of 19.17 points between the top and bottom.

Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Claude Sonnet 4.6 showed abnormal results in today's Smoke test, with the material constraint dimension dropping sharply. The drop may be due to sampling variance but warrants further monitoring.

Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

In today's Smoke evaluation, Claude Opus 4.7's main score dropped by 9 points from 97.75 to 88.75, primarily due to a sharp decline in the material constraint dimension from 95 to 75 points—a direct loss of 20 points in a single day.

7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline

This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.

Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

Today's Smoke Lite evaluation results show a three-way tie for first place at 88.75 points, while the Claude series suffered sharp declines. The shakeup signals that open models are rapidly closing the gap with closed-source leaders.

NTE Game Developer Confirms Ban on AI Core Assets, Community Divided Over Quality vs Efficiency

NTE game development team confirmed that future core assets and character art will not use AI technology, prioritizing quality and reputation. The community is divided over this decision.

Nvidia Releases 2.6B Open-Source World Model: Innovative Breakthrough Sparks Security Controversy

Nvidia has officially released a 2.6B-parameter open-source world model that supports controllable world generation from a single image, text, and trajectory, running on a single GPU. The release has drawn both praise for democratizing AI research and criticism over potential misuse for generating fake content.

Anthropic Calls for Aggressive US AI Policy Toward China, Sparks Heated Debate Over Safety Lab Positioning

Anthropic published a new paper on May 14 urging the US government to take more aggressive measures against China in AI. The company's shift from a cautious safety lab to a hawkish stance has sparked intense controversy.

GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?

Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained perfect.

Three Models Plunge by 28 Points, Claude Still Near Perfect Score

Today's YZ Index Smoke lightweight test reveals that three leading models suffered significant drops, while Claude models dominate near-perfect scores with structural advantages in code execution and material constraint.

Amazon Launches Shopping-Focused Alexa, E-commerce AI Moves to the Frontline

On May 13, 2026, Amazon launched "Alexa for Shopping," an AI-powered shopping assistant that integrates personalized recommendations, voice purchasing, price comparisons, and deal alerts within the Amazon ecosystem. The move signals a shift in e-commerce from search-based interfaces to conversational AI agents.

Claude Paid Plans to Include Monthly Usage Credits

Anthropic announced that starting June 15, 2026, Claude paid plans will include monthly credits for programmatic tools like Claude Agent SDK and Claude Code GitHub Actions. This move aims to integrate Claude deeper into development workflows and automation, lowering the barrier for developers to test real-world scenarios.