Behind GPT-o3's 8.7-Point Surge: Weekly Testing of 11 AI Models Reveals 3 Dangerous Signals

100 test questions, 11 top-tier models, and this week's evaluation results made me gasp—not because of ranking changes, but because of three dangerous signals revealed in the data.

Signal One: Stability Has Become a Luxury

GPT-o3's stability score surged by 8.7 points this week, a glaring figure. Keep in mind, its overall score is only 68 points, ranking last among the 11 models. What does it mean when a bottom-ranked model can significantly improve its stability?

It means OpenAI has finally realized that users would rather have a stable 60-point model than a "schizophrenic" product that swings between 90 and 40 points.

More ironically, Claude Opus 4.6's stability plummeted by 7.6 points during the same period. As one of the previously most stable models, Claude's fall this time was quite hard. Reviewing recent user feedback, I found a common pattern: since mid-March, Claude frequently displays "Sorry, I cannot complete this request," even for simple code debugging tasks.

文心一言4.0's stability also dropped by 3.7 points. Baidu has been frantically iterating its underlying architecture lately, apparently overextending itself.

Signal Two: Collective Regression in Long-Context Capabilities

This week's most peculiar phenomenon: four models simultaneously showed declining long-context processing capabilities.

Claude Sonnet 4.6: -5 points
DeepSeek V3: -4 points
GPT-4o: +5.5 points (the only one bucking the trend)
Other models: basically flat or slight declines

This is no coincidence. Analyzing the test data, I found the main issue lies in processing ultra-long texts exceeding 32K tokens. When input exceeds 32K, model accuracy drops off a cliff, especially for tasks requiring cross-paragraph reasoning.

This exposes the ceiling of current Transformer architecture—the computational complexity of attention mechanisms grows quadratically when processing ultra-long sequences, forcing companies to struggle between hardware costs and effectiveness.

Interestingly, GPT-4o bucked the trend with a 5.5-point increase. According to insider information, OpenAI has recently been testing a new sparse attention mechanism, which appears to be showing initial results. Whether this improvement can be sustained remains to be seen.

Signal Three: Chinese Models Are Changing the Rules of the Game

豆包 Pro surged 7.9 points in the knowledge work dimension this week, firmly holding first place overall (83.7 points). What does this achievement mean?

First, ByteDance's computational advantage is beginning to show. While other companies are still queuing for H100s, ByteDance has already begun large-scale deployment of its self-developed training clusters. More importantly, 豆包 has a natural advantage in Chinese corpus accumulation—the high-quality Chinese content generated daily on Douyin is beyond the reach of other models.

Although DeepSeek's long-context capability dropped by 4 points, it still firmly holds second place (80.8 points). The R1 version's stability improved by 1.3 points, indicating this low-profile company is steadily optimizing its product.

In contrast, Qwen Max's 4-point decline in programming capability is rather glaring. Alibaba has recently focused on the open-source version of Qwen 2.5, apparently spreading itself too thin.

The Overlooked Grok 3

Grok 3 might be this week's most underestimated contender. Its modest 1.8-point improvement in knowledge work capability seems unremarkable, but look closely at its sub-scores: programming 89.3 points (second highest), long-text 87.0 points (third highest), knowledge work 78.7 points.

This is a hexagonal warrior with no weaknesses. Musk's Twitter data advantage is beginning to pay off, particularly in real-time information processing and multimodal understanding.

Three Predictions

Based on this week's data, I'll make three bold predictions:

1. Within the next two months, at least 3 companies will announce they're abandoning the arms race for ultra-long context (128K+), instead optimizing for quality within 32K. The reason is simple: the ROI doesn't make sense.

2. Stability will become the core competitive advantage in the next phase. GPT-o3's comeback path will be emulated by more companies—ensure stability first, then pursue peak performance.

3. Chinese models will comprehensively surpass GPT-4o before June. 豆包 Pro has already proven this path is viable, and both DeepSeek and 文心一言 are building momentum.

AI model competition is shifting from "who has the highest peak" to "who has the most stable baseline." In this marathon, endurance matters more than explosive power.


Data source: YZ Index | Run #33 | View raw data