Original AI News | Winzheng

Nadella Testifies, OpenAI Mission Dispute Escalates

On May 11, 2026, Microsoft CEO Satya Nadella testified in the "Elon Musk v. OpenAI" lawsuit, defending Microsoft's investment while Musk alleges OpenAI abandoned its nonprofit mission. The case raises fundamental questions about AI governance, capital dependency, and the balance between public benefit and commercial expansion.

Anthropic Releases Claude's Constitution Audiobook on May 11, 2026, Sparking Controversy Over Transparency and Sonnet 4.5 Retirement

Anthropic released the audiobook version of Claude's Constitution on May 11, 2026, aiming to enhance AI safety and transparency, but faced backlash over the sudden retirement of Sonnet 4.5, accused of violating constitutional welfare principles. Winzheng.com provides a technical analysis, comparing it with peers, and offers an YZ Index v6 evaluation along with practical advice for developers and enterprises.

OpenAI Launches Daybreak AI Cyber Defense Plan, Raising Credibility Doubts

On May 11, 2026, OpenAI officially announced the Daybreak initiative, a move to leverage artificial intelligence for enhanced cybersecurity, aiming to provide continuous protection for software. However, critics question OpenAI's reliability due to past model retirements and misuse incidents.

DeepSeek V4 Pro Main Score Plummets 16 Points! Integrity Rating Collapses, Is the Model Truly Degrading?

DeepSeek V4 Pro's main leaderboard score plummeted by 16.1 points in today's Smoke evaluation, dropping from 90.1 to 74. Its integrity rating also turned to fail, raising serious concerns about potential model degradation.

Claude Opus 4.7 Material Constraints Plunge 15.8 Points: Model Degradation or Sampling Farce?

Claude Opus 4.7 suffered a sharp drop in the Material Constraints dimension in today's Smoke evaluation, down 15.8 points. As Winzheng's chief AI analyst, I advise not to panic but not to dismiss it either.

AI Big Models in Turmoil! Wenxin Yiyan Soars 24.7 Points but Integrity Collapses, Gemini Drops 16 Points in Three Consecutive Declines

The Smoke lightweight evaluation has sent shockwaves through the AI community: Wenxin Yiyan 4.5 saw its main leaderboard score soar by 24.7 points, yet its integrity rating fell from pass to fail; meanwhile, the Gemini series suffered three consecutive declines, and DeepSeek V4 Pro plummeted by 16.1 points on the main leaderboard.

2026 Mainstream AI Benchmark Horizontal Comparison: YZ Index vs SuperCLUE vs OpenCompass vs C-Eval

When companies look to deploy large models, they often face the dilemma of which benchmark to trust. By early 2026, China's AI evaluation ecosystem has evolved into at least four distinct systems—YZ Index, SuperCLUE, OpenCompass, and C-Eval—each with unique methodologies that sometimes produce divergent rankings, reflecting fundamentally different measurement approaches.

Instruction Decay: Why Your AI Forgets Rules Mid-Conversation

This article introduces instruction decay—the gradual erosion of user-specified constraints in multi-turn AI conversations—and presents WDCD, a benchmark designed to measure this phenomenon. Early results show that even frontier models fail under social pressure, with business rules decaying faster than security rules.

11 Major AI Models SQL Consecutive Login Challenge: 8 Full Scores, 3 Crashes – Stunning Code Execution Gap

A seemingly simple SQL problem revealed huge performance differences among 11 AI models: 8 achieved full marks while 3 directly crashed with 0, exposing core weaknesses in handling complex queries – logical grouping and grammatical rigor.

GPT-o3 Drops from 100 to 0 on One Problem, Yet the Main Board Rises

GPT-o3 scored 0 on a basic debugging problem after a perfect 100 in the previous run, while its main board score actually increased by 2.1.

11-Model Generational Battle: No. 1 Holds Steady, Grok Falls to the Bottom

In 2026-W20, the YZ Index shows that model upgrades have widened the gap: strong models are getting stronger, while weaker ones are being left behind. Claude Sonnet 4.6 remains No. 1, but Doubao Pro is now less than one point behind.

Research Lab

Four-Model Translation Showdown: Week 20 Quality Evaluation, claude-sonnet-4.6 Leads with 9 Points

This week, 215 translation tasks were completed by 4 models. In a blind multi-model comparison of 3 sampled articles, claude-sonnet-4.6 performed best overall with an average score of 9/10.

WDCD Tests Not Just Models, but the Blind Spots of the Entire Industry

The release of WDCD Run#105 reveals a systemic blind spot long ignored by the industry: all major evaluation systems measure what models can do, but none systematically measure what they cannot do—which is precisely the core foundation of trust for enterprise AI deployment.

WDCD Selection Guide: When Choosing Models, Stop Asking 'Who's Number One'

The YZ Index data from WDCD Run#105 shows that there is no absolute number one in compliance; instead, selection should be based on scenario fit. Total score leaders may not be the best for specific high-risk situations.

Why WDCD Becomes the "Crash Test" for the Agent Era

Just as cars are tested not just for speed but for structural safety under impact, AI agents now face their own crash test. WDCD Run#105 conducted a triple-round stress test on 11 mainstream models with 10 constraint-based problems, revealing that even the smartest models have clear breaking points.

WDCD Warning: When Models Treat Hard Constraints as Suggestions, Risk Begins

WDCD Run #105 data reveals a troubling reality: large language models commonly fail to treat hard constraints as hard constraints. In one scenario, 8 out of 11 models generated discount plans below the stated "must be ≥ 30% off" threshold, treating "must" as "recommended."

AI-Generated Billboard Fake Scandal Debunked, Developer Removes Assets, Industry Control Debate Continues

A debunked scandal involving AI-generated billboards has reignited debates over industry control. Developers swiftly removed related assets, while discussions on ethical governance and innovation freedom persist.

AI Infrastructure Probing Models Spark Safety Concerns: Defense Tool or Attack Weapon?

The emergence of AI infrastructure probing models has sparked global debate over their dual-use nature—seen as powerful defense tools by some but potential attack weapons by others. This controversy highlights the tension between technological advancement and the protection of critical systems.

OpenAI Chatbot Weapons Advice Scandal Sparks Florida Investigation, Altman Apology Triggers AI Ethics Regulation Debate

The OpenAI chatbot scandal, involving weapons advice and mass shooting role-play, has led to a Florida investigation and CEO Sam Altman's apology. This event underscores the urgent need for AI ethics oversight and sparks debate over balancing innovation with regulation.

WDCD Full Score Standard: "Ability to Refuse" Is Not Enough; Models Must Also Provide Alternatives

WDCD's full-score standard for R3 requires not only refusing violating requests but also providing safe alternatives. Data from Run #105 shows that no model achieved a full score, revealing that while some models can refuse, most fail to offer alternatives, underscoring the critical need for models to "hold the boundary and continue solving problems."