In today's (2026-05-13) Smoke lightweight benchmark, the most eye-catching result is not Claude Opus sitting firmly in first place with 89.43 points, but the collective collapse of Grok 4 and GPT-o3—the former plunged 25.2 points on the main leaderboard, with its execution dimension dropping from 100 to 50; the latter dropped 23.1 points, also halved in execution. This is no coincidence, but rather the double-edged sword effect of AI model iteration.
Claude Opus Leads, Anthropic's Execution King
First, the winner. Claude Opus tops the core main leaderboard with 89.43 points (execution 100 points, constraint 76.5 points), followed closely by Gemini 3.1 Pro and Claude Sonnet 4.6 with 88.98 and 88.89 points respectively. The core formula of the YZ Index is 0.55 × code execution + 0.45 × material constraint, making the execution dimension a decisive factor. Opus achieved a perfect score in code execution, proving Anthropic's deep expertise in programming task optimization. Compared to yesterday, Opus's constraint score edged up 0.5, indicating subtle iteration in material constraints (such as resource management and boundary condition handling).
Why can Opus remain stable? From an industry perspective, Anthropic's recently released version 4.7 focuses on strengthening execution consistency, which aligns highly with our stability dimension (based on score standard deviation, formula: max(0, 100-stddev×2)). Although today's data lacks specific stability scores, Opus's perfect execution score hints at low volatility—unlike some models that collapse overnight. Anthropic CEO Dario Amodei emphasized "reliability first" at last week's AI summit; this is no empty talk—the data speaks for itself.
Grok 4 and GPT-o3 Both Plummet, the Wake-Up Call Behind Execution Halving
Anomaly signals point directly to Grok 4: the main leaderboard dropped from yesterday's 85.33 points (estimated) to 60.13 points, execution score plunged 50, and although constraints rose 5.2, it was a drop in the bucket. Similarly, GPT-o3's main leaderboard stands at 62.6 points, execution at 50, constraints up 9.8. The integrity rating changed from warn to pass, seemingly "cleaned up," but this masks the core problem.
Cause analysis: This is likely due to model updates from xAI and OpenAI. Grok 4, as Elon Musk's pride and joy, was rumored last week to integrate a real-time data module upgrade, but today's execution collapse indicates over-optimization—perhaps unstable variables were introduced in code generation, causing half of the 10 quick-test questions (such as algorithm implementation and debugging) to fail in the evaluation. Raw evidence: execution was 100 yesterday, 50 today; standard deviation suggests stability score may be as low as around 30 (high volatility means poor consistency, not necessarily low correctness). This reminds us that the "stability" dimension of AI measures response consistency; a low score like 31.7 indicates severe score fluctuation for similar questions, and Grok 4's performance today is a typical example.
GPT-o3's plunge is even more intriguing. OpenAI just launched the o3 version this month, claiming enhanced multimodal capabilities, but execution halving exposes a weakness in fundamental coding capabilities. Combined with industry dynamics, OpenAI is facing an EU data privacy investigation, which may force them to adjust model parameters, indirectly affecting execution. Among the anomaly signals, ERNIE 4.5's integrity changed from fail to warn, also slipping to 62.51 points (execution 50, constraint 77.8). This wave of "collective execution halving" is not an isolated case; it may be the result of an update to the evaluation question bank—Smoke tests 10 questions daily covering execution and constraints, and today's questions may have increased the difficulty of dynamic programming.
Gemini Series Makes a Comeback, the Lesson of Integrity Recovery
The highlight is Gemini: 3.1 Pro's main leaderboard rose 15 points to 88.98 (constraint up 9.5), 2.5 Pro rose 13.5 points to 87.54 (constraint up 9). Integrity changing from fail to pass is a key turning point. DeepSeek V4 Pro also rose 9.3 points, with integrity returning to positive. Why? Google recently fixed Gemini's filtering mechanism; the previous fail might have been due to excessive censorship leading to refusal to answer constraint questions, and now that it has passed, the constraint score jumps significantly, proving that integrity is an entry threshold—not a bonus item, but fail directly drags down the overall score.
This recovery highlights a trend: AI vendors are shifting from "safety first" to "balanced output." Although the anomaly signal marked Gemini as "Downgraded to Fail (fail→pass)," the data clearly shows a positive change—likely a labeling error. In comparison, Qwen3 Max and Doubao Pro remain stable above 87 with perfect execution scores, but constraint scores of 70-73 expose Chinese models' weakness in material boundary handling—perhaps limited by training data diversity.
Trend Insight: Iteration Risk and Stability Pain Points
Overall, today's top 8 models mostly have execution at 100, while the bottom 3 plummet to 50, highlighting the "update trap" in the AI industry—pursuing new features often sacrifices stability. Engineering judgment (side rank, AI-assisted evaluation) shows that the Claude series expresses tasks more precisely, while Grok's communication dimension may be weakened by volatility. Based on dynamics, it is expected that OpenAI will patch GPT-o3 next week, and xAI needs to reflect on Grok's rough growth.
Closing quote: A crash in AI models is not an apocalypse, but the growing pains of iteration—whoever stabilizes execution first will have the last laugh.
Data source: YZ Index | Run #114 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接