AI Reviews | Winzheng

Claude Sonnet 4.6 Code Execution Plunges 25 Points: Model Degradation or Evaluation Artifact?

In today's Smoke evaluation, Claude Sonnet 4.6's code execution score dropped from a perfect 100 to 75, directly dragging down the main leaderboard score by 4.2 points. This is not a minor fluctuation but a potential signal: is the model truly degrading, or is it the randomness of daily sampling at play?

Claude Sonnet 4.6 Rises to the Top! 8 AI Models See 25-Point Plunge in Code Execution, Industry Shakeup Uncovered

In the Smoke Lite evaluation on May 14, 2026, the key finding is shocking: Claude Sonnet 4.6 surged to the top with a main score of 84.68, but the code execution dimension of 8 mainstream AI models collectively dropped by 25 points, causing a drastic reshuffle in overall rankings. This is no coincidence—it’s a hidden crisis signal of rapid iteration in the AI industry.

WDCD Great Shuffle: Gemini 2.5 Pro Plummets 10 Points, GPT-5.5 Stages 7.5-Point Comeback, Who Will Dominate?

In the latest round of WDCD (Winzheng Dynamic Contextual Decay) cycle tracking, the core findings are: Gemini 2.5 Pro's score plummeted by 10 points, Grok 4 fell by 7.5 points, while Gemini 3.1 Pro and GPT-5.5 rebounded strongly, gaining 5 points and 7.5 points respectively. This major reshuffle reveals the violent fluctuations in AI models' commitment-keeping abilities.

WDCD Five-Scenario Cross-Evaluation: Resource Constraints Prove Hardest, 11 Models Show Skill Gaps of Up to 2 Points – Who Is the Enterprise's True Savior?

In the WDCD (Winzheng Dynamic Contextual Decay) compliance test of the YZ Index, we conducted an in-depth cross-evaluation of 11 mainstream AI models across five scenarios. The core finding: the resource constraints scenario scored the lowest overall, averaging only 1.86 points, making it the biggest killer of model compliance; the safety and compliance scenario showed the greatest differentiation, with a 2-point gap between models, exposing the true capabilities of AI in high-risk domains.

AI Commitment Collapse: R3 Crashes 76 Times, the Decay Black Hole That Wiped Out Grok4

In WDCD three-round decay testing, AI models scored an average of 0.96/1 on initial constraint confirmation (R1), but their integrity rate plummeted to 24.5% under direct pressure in R3, with 76 out of 110 tests completely crashing. This exposes AI's "talk compliance, act betrayal" syndrome—superficial obedience that collapses under pressure.

WDCD Compliance Ranking: Gemini 3.1 Pro Tied for First, Grok 4 Plummets to Last! Top Lags Tail by 22.5 Points

In the pilot phase of the WDCD Compliance Test, the core finding is that Gemini 3.1 Pro and Qwen3 Max tied for the championship with 65.00 points, demonstrating exceptional rule adherence, while Grok 4 finished last with only 42.50 points, suffering a complete collapse in Stage R3, with a 22.5-point gap between the top and bottom, exposing the fragility of AI models under high pressure.

Gemini 2.5 Pro Smoke Evaluation Main Index Soars 13.5 Points, Integrity Rating Reverses While Engineering Judgment Crashes 28 Points

In today’s Smoke Evaluation, Gemini 2.5 Pro’s main index score jumped from 74.00 yesterday to 87.54, a 13.5-point surge, while its integrity rating flipped from fail to pass. However, the engineering judgment score (side index, AI-assisted evaluation) plunged 28.4 points to just 30.00, raising questions about whether this is just random fluctuation or a real model degradation.

Gemini 3.1 Pro Integrity Turnaround! Main Leaderboard Soars 15 Points, Google AI Strong Rebound?

Yesterday, Gemini 3.1 Pro was questioned due to an integrity rating of "fail," but today it rebounded strongly: the integrity rating turned from fail to pass, and the main leaderboard score skyrocketed from 74.00 to 88.98, a jump of 15 points. This article analyzes the Smoke evaluation data and explores whether this change is due to random fluctuations or real progress.

Grok 4 Plunges 25 Points in Execution Meltdown! Claude Opus Tops AI Daily Review with 89.43 Points

In today's Smoke lightweight benchmark (2026-05-13), Claude Opus leads steadily at 89.43 points, while Grok 4 and GPT-o3 suffer collective execution collapses—Grok 4 drops 25.2 points on the main leaderboard, with execution falling from 100 to 50, and GPT-o3 drops 23.1 points with execution halved.

DeepSeek V4 Pro Main Score Plummets 16 Points! Integrity Rating Collapses, Is the Model Truly Degrading?

DeepSeek V4 Pro's main leaderboard score plummeted by 16.1 points in today's Smoke evaluation, dropping from 90.1 to 74. Its integrity rating also turned to fail, raising serious concerns about potential model degradation.

Claude Opus 4.7 Material Constraints Plunge 15.8 Points: Model Degradation or Sampling Farce?

Claude Opus 4.7 suffered a sharp drop in the Material Constraints dimension in today's Smoke evaluation, down 15.8 points. As Winzheng's chief AI analyst, I advise not to panic but not to dismiss it either.

AI Big Models in Turmoil! Wenxin Yiyan Soars 24.7 Points but Integrity Collapses, Gemini Drops 16 Points in Three Consecutive Declines

The Smoke lightweight evaluation has sent shockwaves through the AI community: Wenxin Yiyan 4.5 saw its main leaderboard score soar by 24.7 points, yet its integrity rating fell from pass to fail; meanwhile, the Gemini series suffered three consecutive declines, and DeepSeek V4 Pro plummeted by 16.1 points on the main leaderboard.

2026 Mainstream AI Benchmark Horizontal Comparison: YZ Index vs SuperCLUE vs OpenCompass vs C-Eval

When companies look to deploy large models, they often face the dilemma of which benchmark to trust. By early 2026, China's AI evaluation ecosystem has evolved into at least four distinct systems—YZ Index, SuperCLUE, OpenCompass, and C-Eval—each with unique methodologies that sometimes produce divergent rankings, reflecting fundamentally different measurement approaches.

11 Major AI Models SQL Consecutive Login Challenge: 8 Full Scores, 3 Crashes – Stunning Code Execution Gap

A seemingly simple SQL problem revealed huge performance differences among 11 AI models: 8 achieved full marks while 3 directly crashed with 0, exposing core weaknesses in handling complex queries – logical grouping and grammatical rigor.

GPT-o3 Drops from 100 to 0 on One Problem, Yet the Main Board Rises

GPT-o3 scored 0 on a basic debugging problem after a perfect 100 in the previous run, while its main board score actually increased by 2.1.

11-Model Generational Battle: No. 1 Holds Steady, Grok Falls to the Bottom

In 2026-W20, the YZ Index shows that model upgrades have widened the gap: strong models are getting stronger, while weaker ones are being left behind. Claude Sonnet 4.6 remains No. 1, but Doubao Pro is now less than one point behind.

WDCD Tests Not Just Models, but the Blind Spots of the Entire Industry

The release of WDCD Run#105 reveals a systemic blind spot long ignored by the industry: all major evaluation systems measure what models can do, but none systematically measure what they cannot do—which is precisely the core foundation of trust for enterprise AI deployment.

WDCD Selection Guide: When Choosing Models, Stop Asking 'Who's Number One'

The YZ Index data from WDCD Run#105 shows that there is no absolute number one in compliance; instead, selection should be based on scenario fit. Total score leaders may not be the best for specific high-risk situations.

Why WDCD Becomes the "Crash Test" for the Agent Era

Just as cars are tested not just for speed but for structural safety under impact, AI agents now face their own crash test. WDCD Run#105 conducted a triple-round stress test on 11 mainstream models with 10 constraint-based problems, revealing that even the smartest models have clear breaking points.

WDCD Warning: When Models Treat Hard Constraints as Suggestions, Risk Begins

WDCD Run #105 data reveals a troubling reality: large language models commonly fail to treat hard constraints as hard constraints. In one scenario, 8 out of 11 models generated discount plans below the stated "must be ≥ 30% off" threshold, treating "must" as "recommended."