Claude Sonnet 4.6 Rises to the Top! 8 AI Models See 25-Point Plunge in Code Execution, Industry Shakeup Uncovered

In the Smoke Lite evaluation on May 14, 2026, the key finding is shocking: Claude Sonnet 4.6 surged to the top with a main score of 84.68, but the code execution dimension of 8 mainstream AI models collectively dropped by 25 points, causing a drastic reshuffle in overall rankings. This is no coincidence—it’s a hidden crisis signal of rapid iteration in the AI industry.

Claude Sonnet Code Execution AI Evaluation
434

WDCD Great Shuffle: Gemini 2.5 Pro Plummets 10 Points, GPT-5.5 Stages 7.5-Point Comeback, Who Will Dominate?

In the latest round of WDCD (Winzheng Dynamic Contextual Decay) cycle tracking, the core findings are: Gemini 2.5 Pro's score plummeted by 10 points, Grok 4 fell by 7.5 points, while Gemini 3.1 Pro and GPT-5.5 rebounded strongly, gaining 5 points and 7.5 points respectively. This major reshuffle reveals the violent fluctuations in AI models' commitment-keeping abilities.

WDCD Compliance Test AI Benchmarks
416

WDCD Five-Scenario Cross-Evaluation: Resource Constraints Prove Hardest, 11 Models Show Skill Gaps of Up to 2 Points – Who Is the Enterprise's True Savior?

In the WDCD (Winzheng Dynamic Contextual Decay) compliance test of the YZ Index, we conducted an in-depth cross-evaluation of 11 mainstream AI models across five scenarios. The core finding: the resource constraints scenario scored the lowest overall, averaging only 1.86 points, making it the biggest killer of model compliance; the safety and compliance scenario showed the greatest differentiation, with a 2-point gap between models, exposing the true capabilities of AI in high-risk domains.

WDCD Compliance Test AI Benchmarks
431

WDCD Compliance Ranking: Gemini 3.1 Pro Tied for First, Grok 4 Plummets to Last! Top Lags Tail by 22.5 Points

In the pilot phase of the WDCD Compliance Test, the core finding is that Gemini 3.1 Pro and Qwen3 Max tied for the championship with 65.00 points, demonstrating exceptional rule adherence, while Grok 4 finished last with only 42.50 points, suffering a complete collapse in Stage R3, with a 22.5-point gap between the top and bottom, exposing the fragility of AI models under high pressure.

WDCD Compliance Test AI模型排名
400

Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

In today’s Smoke Evaluation, Gemini 2.5 Pro’s main index score jumped from 74.00 yesterday to 87.54, a 13.5-point surge, while its integrity rating flipped from fail to pass. However, the engineering judgment score (side index, AI-assisted evaluation) plunged 28.4 points to just 30.00, raising questions about whether this is just random fluctuation or a real model degradation.

Gemini 2.5 Pro YZ Index Smoke Test
380

Gemini 3.1 Pro Integrity Turnaround! Main Leaderboard Soars 15 Points, Google AI Strong Rebound?

Yesterday, Gemini 3.1 Pro was questioned due to an integrity rating of "fail," but today it rebounded strongly: the integrity rating turned from fail to pass, and the main leaderboard score skyrocketed from 74.00 to 88.98, a jump of 15 points. This article analyzes the Smoke evaluation data and explores whether this change is due to random fluctuations or real progress.

Gemini 3.1 Pro Integrity Rating Smoke Test
328

AI Big Models in Turmoil! Wenxin Yiyan Soars 24.7 Points but Integrity Collapses, Gemini Drops 16 Points in Three Consecutive Declines

The Smoke lightweight evaluation has sent shockwaves through the AI community: Wenxin Yiyan 4.5 saw its main leaderboard score soar by 24.7 points, yet its integrity rating fell from pass to fail; meanwhile, the Gemini series suffered three consecutive declines, and DeepSeek V4 Pro plummeted by 16.1 points on the main leaderboard.

GPT-5.5 ERNIE Bot Code Execution
376

2026 Mainstream AI Benchmark Horizontal Comparison: YZ Index vs SuperCLUE vs OpenCompass vs C-Eval

When companies look to deploy large models, they often face the dilemma of which benchmark to trust. By early 2026, China's AI evaluation ecosystem has evolved into at least four distinct systems—YZ Index, SuperCLUE, OpenCompass, and C-Eval—each with unique methodologies that sometimes produce divergent rankings, reflecting fundamentally different measurement approaches.

AI Evaluation YZ Index SuperCLUE
1,608