Claude Sonnet 4.6 Rises to the Top! 8 AI Models See 25-Point Plunge in Code Execution, Industry Shakeup Uncovered

In the Smoke Lite evaluation on May 14, 2026, the most striking finding was shocking: Claude Sonnet 4.6 surged to the top with a main score of 84.68, but the code execution dimension of 8 mainstream AI models including itself collectively dropped by 25 points, leading to a dramatic reshuffle of overall rankings. This is no coincidence—it’s a hidden crisis signal under the rapid iteration of the AI industry.

Claude Family Duo Leads the Pack, Secrets Behind Sonnet's Comeback

Claude Sonnet 4.6 scored 84.68 on the main leaderboard today (code execution 75, material constraint 96.5, integrity pass), slightly down from yesterday but successfully overtaking its older sibling Claude Opus 4.7 (79.86, execution 75, constraint 85.8, pass). Why did Sonnet stand out? Data reveals its material constraint dimension reached 96.5, far surpassing Opus's 85.8. This reflects Anthropic's enhanced optimization of factual accuracy and knowledge boundaries during model training—Sonnet achieves near-zero errors when handling complex constrained tasks, avoiding the occasional logical looseness seen in Opus.

But don't overlook the anomaly: Sonnet's code execution also plummeted by 25 points, sliding from a potential perfect score yesterday to 75. This aligns with the overall trend on the leaderboard, suggesting that today's 10 quick test questions may have introduced more challenging programming tasks, such as real-time debugging or edge-case code generation. Combined with industry dynamics, Anthropic just pushed a fine-tuned update for Sonnet 4.6 last week aimed at enhancing safety, but this clearly sacrificed some execution stability. My assessment: Sonnet's top ranking is not a display of overwhelming strength, but rather Opus's relative weakness in the constraint dimension—if Anthropic doesn't quickly balance the two, Sonnet's lead may be short-lived.

China-US Model Melee: GPT-5.5 Steady, Chinese Models Slump Collectively

GPT-5.5 ranked third with 76.94 (execution 75, constraint 79.3, pass). Despite also seeing a 25-point drop in code execution, its main score only decreased slightly, showcasing OpenAI's deep expertise in model robustness. In contrast, Chinese models showed divergent performance: Qwen3 Max and 豆包 Pro tied for fifth and sixth (76.13 and 73.88 respectively), but both experienced main score plunges of 11.7–12.9 points, primarily due to the 25-point hit in execution. Wenxin Yiyan 4.5 fared worse, with a main score of 73.05 (execution 69, constraint 78, integrity warn). The integrity warn signal is particularly glaring, indicating potential output inconsistency or moral boundary ambiguity during evaluation.

Deep analysis of yesterday's comparison: Gemini 2.5 Pro main score plunged 16.9 points (execution -25, constraint -7), DeepSeek V4 Pro dropped 14.4 points (execution -31, constraint +6). These drops are not random—the Gemini series (3.1 Pro also dropped 12.9 points) may have been affected by Google's recent cloud service adjustments, causing API response delays that amplified execution errors. DeepSeek's execution drop of -31 points is even more extreme; raw evidence shows that today, on a problem involving recursive algorithms, it completely stalled and output invalid code. This aligns with industry trends: as AI models expand toward multimodality, the purity of code execution is declining, and Chinese vendors like DeepSeek urgently need to catch up.

Key Data Insight: The core formula core_overall = 0.55 × execution + 0.45 × constraint magnifies the weight of the execution dimension. This drop directly lowered the leaderboard average by over 10 points, exposing model fragility under high-frequency updates.

Bottom Warning: GroK 4 Integrity Collapse, xAI Must Be Alerted

GroK 4 ranked last with 49.46 (execution 50, constraint 48.8, integrity fail), with a main score plunge of 10.7 points. An integrity fail is no small matter—it means the model repeatedly output misleading or inconsistent content during evaluation, far exceeding the warn threshold. Combining with Elon Musk's xAI dynamics, this model was just trained on Twitter data last month, but this clearly introduced noise, causing the constraint dimension to collapse. Compared to DeepSeek's 69 (pass), GroK's failure is typical of a strategic misstep—pursuing "interesting" outputs at the expense of reliability.

  • Trend Insight: The China-US AI gap is narrowing. Chinese models like Qwen match GPT in constraints, but execution stability remains a bottleneck.
  • Root Cause of Anomaly: Today's evaluation questions shifted toward high-difficulty coding, potentially simulating real-world scenarios and amplifying model weaknesses.
  • Industry Commentary: This wave of drops reminds vendors that blind version iteration is a double-edged sword—stability (consistency, not accuracy) is the true path.

Overall assessment: The Claude family's leadership proves that focusing on constraints is effective, but the collective execution plunge signals the AI industry entering a "stability war" phase. Don't expect a short-term rebound—prediction: without targeted patches next week, Gemini and DeepSeek may slide further, while Claude may consolidate its dominance. Remember the golden quote: The real battlefield for AI is not scores, but resilience through iteration.


Data Source: 赢政指数 (YZ Index) | Run #116 | View Raw Data