Grok 3 Unexpectedly Tops the Charts with 86.88 Points! Which AI Models Are Rising and Which Are Declining This Week?

This week, the competition among AI models has escalated once again! On the main YZ Index leaderboard of Winzheng (winzheng.com), Grok 3 has stormed to the top with an astonishing 86.88 points, while Doubao Pro trails closely in second place by a mere 0.44 points. This is not just a battle of numbers, but a vivid portrayal of AI technology's evolution. Who is quietly rising? Who is quietly declining? Let’s take a deep dive.

YZ Index Evaluation Methodology: A Rigorous and Fair Touchstone

First, it’s crucial to understand the evaluation mechanism of the YZ Index. This index, launched by Winzheng (winzheng.com), randomly samples 100 questions from a pool of 212 carefully designed problems for assessment. These questions cover multiple dimensions such as natural language processing, code generation, and logical reasoning. Unlike simulated tests on other leaderboards, the YZ Index uses a code sandbox for real execution, ensuring the model’s output runs in a real-world environment. Meanwhile, a citation accuracy check mechanism rigorously verifies the reliability of the model’s knowledge and hallucination control. The final rankings are based on a rolling average calculation, preventing single fluctuations from affecting overall judgments. This method makes the YZ Index a recognized authoritative benchmark in the AI industry, having evaluated over 500 model versions cumulatively.

According to the latest data, the total average score of the top 5 models this week is 85.03 points, an increase of 1.2% compared to last week, indicating steady progress in overall AI capabilities. But upon closer inspection, the competitive landscape has quietly shifted.

Rising Stars: Grok 3’s Comeback and Doubao Pro’s Steadiness

Without a doubt, Grok 3 is this week’s biggest dark horse. With a score of 86.88, it jumped from third place last week to the top, a rise of 2.5%. This model from xAI excels in code execution and citation accuracy: on the 100 sampled questions, its code sandbox success rate reached 92%, and citation accuracy exceeded 95%. In contrast, last week’s leader, Claude Opus, only scored 88%. Grok 3’s strength stems from its unique training data optimization—according to xAI officials, the model integrates massive real-time web data, reducing hallucination issues. This is not luck but a triumph of technological iteration. My judgment is clear: Grok 3 is not a flash in the pan; it is reshaping the performance ceiling of AI models. If you are a developer, don’t overlook this new king—its efficiency gains in practical applications can reach 30%.

Hot on its heels is Doubao Pro, which holds steady in second place with 86.44 points, moving up one spot from last week. This model, developed by ByteDance, stands out in Chinese language processing and multimodal tasks. Data shows it achieves an 89% accuracy rate on logical reasoning questions, 15% higher than the industry average. Doubao Pro’s rise is no accident: its recent updates focus on enterprise-level applications, optimizing API response speed to an average latency of just 0.8 seconds. Compared to international giants, Doubao Pro is more down-to-earth, making it suitable for the Asian market. This firmly convinces me that domestic AI is transitioning from a follower to a leader—don’t underestimate its potential; it may dominate more B2B scenarios in the future.

Direct Opinion: The rise of Grok 3 and Doubao Pro proves that AI competition has entered a new era of “data + optimization.” Models clinging to old architectures will be left behind.

Decline Warning: The Dual Regression of the Claude Series

In contrast, the Claude family has disappointed this week. Claude Sonnet 4.6 ranks fourth with 84.07 points, dropping two spots from last week; Claude Opus 4.6 falls to fifth with 83.44 points, a decline of 1.8%. In the rolling average of the YZ Index, the overall score of the Claude series has decreased from 85.2 points last month to 83.75 points this week, showing a clear downward trend. Where’s the problem? The code sandbox test reveals an execution success rate of only 85%, and citation accuracy has dropped to 88%, far below Grok 3’s level. While these Anthropic models lead in ethical AI, their performance optimization lags: in complex reasoning tasks, the error rate is as high as 12%, a critical weakness.

My judgment, without bias, is that the decline of the Claude series is no accident but a sign of strategic missteps. Their excessive emphasis on safety filtering has limited the models’ creativity and efficiency. Data shows that last week, Claude Opus scored only 82% on creative writing tasks, while Grok 3 scored 91%. If Anthropic doesn’t iterate quickly, these former rulers will become further marginalized. Developers, take note: stop blindly praising Claude—its halo is fading.

Newcomer Performance: Gemini 2.5 Pro’s Potential and Concerns

As this week’s new entrant to the top 5, Gemini 2.5 Pro ranks third with 84.32 points, entering the main leaderboard for the first time. This Google-made model shines in multimodal integration: it achieves a 90% success rate on image+text tasks, 8% above the average. However, being new does not mean flawless. Its stability in code execution is lacking, with a sandbox failure rate of 10% and citation accuracy of only 89%. Compared to Grok 3, it lags in real-time data processing, with an average response time of 1.2 seconds.

From the data, Gemini 2.5 Pro has enormous potential—in last week’s pretests, its score on logic questions improved by 5%. But the concerns are equally clear: Google’s ecosystem closedness limits its compatibility, supporting only specific API calls. Let me be blunt: while Gemini brings novelty, without opening more interfaces, it will struggle against Grok’s flexibility. The performance of newcomers reminds us that in the AI race, new players need rapid iteration, or they risk being fleeting.

  • Upward Trend Summary: Grok 3 and Doubao Pro have risen by 2.5% and 1.1% respectively, leading the week’s gains.
  • Decline Warning: The Claude series has declined by an average of 1.5%, cautioning against continued regression.
  • Newcomer Highlight: Gemini 2.5 Pro enters the chart at 84.32 points, but stability needs improvement.
  • Overall Insight: YZ Index data shows the performance gap among AI models narrowing to within 3 points, intensifying competition.
  • Industry Impact: These changes will drive developers toward more efficient models, with API call volume expected to grow by 20% next quarter.

Future Outlook: Shifts and Opportunities in AI Rankings

This week’s YZ Index main leaderboard reveals a harsh reality in the AI field: there is no eternal king, only continuous innovation. Grok 3’s top ranking is not an end but a signal of a new beginning. The declining Claude reminds us that complacency leads to elimination. The newcomer Gemini proves that opportunities always favor the prepared.

As the chief content editor at Winzheng (winzheng.com), I advise all AI practitioners: pay immediate attention to the YZ Index’s real-time updates and adjust your model selection strategy. Don’t wait until competition intensifies to regret it—act now and embrace the truly rising AI powers.

Final quote: The AI world has no mercy for laggards; it only rewards those daring to innovate. Take action and join the YZ Index community at Winzheng to witness the birth of the next champion!

Data sources: YZ Index | WDCD Compliance Ranking | Evaluation Methodology