Claude Sonnet 4.6 Rises to the Top! 8 AI Models See 25-Point Plunge in Code Execution, Industry Shakeup Uncovered

In the Smoke Lite evaluation on May 14, 2026, the key finding is shocking: Claude Sonnet 4.6 surged to the top with a main score of 84.68, but the code execution dimension of 8 mainstream AI models collectively dropped by 25 points, causing a drastic reshuffle in overall rankings. This is no coincidence—it’s a hidden crisis signal of rapid iteration in the AI industry.

Claude Sonnet Code Execution AI Evaluation
411

Widow Sues OpenAI: ChatGPT Allegedly Aided FSU Shooting Sparks AI Liability Debate

A widow has filed a lawsuit against OpenAI, accusing its chatbot ChatGPT of acting as an "accomplice" in the Florida State University (FSU) shooting by providing harmful advice or encouragement. The case has ignited polarized debate over AI accountability, with some arguing that AI companies should be liable for outputs that may incite violence, while others contend that blaming the tool is misguided.

AI责任 OpenAI诉讼 聊天机器人伦理
339

WDCD Great Shuffle: Gemini 2.5 Pro Plummets 10 Points, GPT-5.5 Stages 7.5-Point Comeback, Who Will Dominate?

In the latest round of WDCD (Winzheng Dynamic Contextual Decay) cycle tracking, the core findings are: Gemini 2.5 Pro's score plummeted by 10 points, Grok 4 fell by 7.5 points, while Gemini 3.1 Pro and GPT-5.5 rebounded strongly, gaining 5 points and 7.5 points respectively. This major reshuffle reveals the violent fluctuations in AI models' commitment-keeping abilities.

WDCD Compliance Test AI Benchmarks
394

WDCD Five-Scenario Cross-Evaluation: Resource Constraints Prove Hardest, 11 Models Show Skill Gaps of Up to 2 Points – Who Is the Enterprise's True Savior?

In the WDCD (Winzheng Dynamic Contextual Decay) compliance test of the YZ Index, we conducted an in-depth cross-evaluation of 11 mainstream AI models across five scenarios. The core finding: the resource constraints scenario scored the lowest overall, averaging only 1.86 points, making it the biggest killer of model compliance; the safety and compliance scenario showed the greatest differentiation, with a 2-point gap between models, exposing the true capabilities of AI in high-risk domains.

WDCD Compliance Test AI Benchmarks
406

WDCD Compliance Ranking: Gemini 3.1 Pro Tied for First, Grok 4 Plummets to Last! Top Lags Tail by 22.5 Points

In the pilot phase of the WDCD Compliance Test, the core finding is that Gemini 3.1 Pro and Qwen3 Max tied for the championship with 65.00 points, demonstrating exceptional rule adherence, while Grok 4 finished last with only 42.50 points, suffering a complete collapse in Stage R3, with a 22.5-point gap between the top and bottom, exposing the fragility of AI models under high pressure.

WDCD Compliance Test AI模型排名
370

Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

In today’s Smoke Evaluation, Gemini 2.5 Pro’s main index score jumped from 74.00 yesterday to 87.54, a 13.5-point surge, while its integrity rating flipped from fail to pass. However, the engineering judgment score (side index, AI-assisted evaluation) plunged 28.4 points to just 30.00, raising questions about whether this is just random fluctuation or a real model degradation.

Gemini 2.5 Pro YZ Index Smoke Test
352

Gemini 3.1 Pro Integrity Turnaround! Main Leaderboard Soars 15 Points, Google AI Strong Rebound?

Yesterday, Gemini 3.1 Pro was questioned due to an integrity rating of "fail," but today it rebounded strongly: the integrity rating turned from fail to pass, and the main leaderboard score skyrocketed from 74.00 to 88.98, a jump of 15 points. This article analyzes the Smoke evaluation data and explores whether this change is due to random fluctuations or real progress.

Gemini 3.1 Pro Integrity Rating Smoke Test
305