11 AI Models Surge 40 Points in Programming Tests: What Really Happened?

Mar 22, 2026 690 Views - Read Source Winzheng Index

DeepSeek GPT-o3 编程能力测试模型评测异常 AI技术洗牌

If you see 11 AI models' programming capabilities collectively surge by around 40 points in a week, what's your first reaction? Exactly - the testing standards changed. But behind this change lie signals worth deeper attention.

Three Key Signals Behind the Anomalous Data

This week's evaluation data can only be described as "absurd": DeepSeek R1's programming capability skyrocketed 47.4 points, Doubao Pro and Grok 3 simultaneously rose 42.4 points, and even the typically steady Claude Opus 4.6 soared 42 points. More bizarrely, all models' gains were concentrated in the 29-47 point range, as if controlled by an invisible hand.

But what truly deserves attention isn't this obvious test adjustment, but three obscured signals:

Signal One: Chinese Models Lead Comprehensively for the First Time

Even accounting for testing factors, three of this week's top four models are from China: Doubao Pro (67.0 points), DeepSeek V3 (66.6 points), and ERNIE Bot4.0 (64.2 points). This is the first time I've seen Chinese models so densely occupy top positions in comprehensive rankings since I began tracking AI model evaluations.

Particularly noteworthy is DeepSeek R1 reaching 67.9 points in the programming dimension, becoming this week's strongest programming model, even surpassing the programming-focused Grok 3 (64.9 points).

Signal Two: OpenAI's Cliff-like Decline

GPT-o3 showed the only negative growth this week: its long context capability plummeted 33.5 points, dropping directly from 62.3 to 28.8 points. More concerning is that GPT-4o and GPT-o3 ranked at the bottom with 39.2 and 34.5 points respectively - the first time OpenAI models have comprehensively lagged in mainstream evaluations.

Data shows GPT-o3's 28.8 points in long text processing is less than half of top-ranked Grok 3 (83.0 points). This gap can no longer be explained by "different strengths."

Signal Three: Long Text Becomes the New Battlefield

Analyzing each model's dimensional scores reveals an interesting phenomenon: long text processing capability is becoming the key metric distinguishing model quality. The top six models all score above 77 points in long text, with Grok 3 reaching 83.0 points and Qwen Max following closely at 80.6 points.

The logic behind this trend is clear: with the proliferation of RAG (Retrieval-Augmented Generation) technology, models' ability to handle long documents and conversations is becoming increasingly important. Whoever can process longer contexts while maintaining comprehension accuracy will gain advantages in practical applications.

Industry Trends Revealed by Testing Standard Changes

While this week's programming tests clearly underwent adjustments (possibly easier questions or relaxed scoring criteria), the adjustment itself reveals important information: the industry is redefining what constitutes "good programming ability."

The differential gains across models show that the DeepSeek series (R1 up 47.4 points, V3 up 42.6 points) improved most significantly, while GPT-4o only rose 29.2 points. This differentiated improvement suggests new testing standards may favor advanced capabilities like code understanding, debugging, and refactoring, not just simple code generation.

Three Trends Worth Watching

First, universal weakness in the knowledge dimension. Even top-ranked Doubao Pro only scored 49.6 points in knowledge, with no model breaking 50 points. This shows that while pursuing long text and programming capabilities, basic knowledge accuracy is being neglected.

Second, frequent changes in evaluation standards. Such massive collective score increases within a week reflect the immaturity of current AI evaluation systems. This brings enormous uncertainty to model selection.

Third, increasing polarization of comprehensive capabilities. The gap between top models (above 60 points) and bottom models (below 40 points) is widening, with the middle ground shrinking. This suggests the AI model market may see a "winner-takes-all" scenario.

A bold prediction: Before the end of 2024, we'll see the first "super model" breaking 80 points across all dimensions, and it will likely come from China.

Data source: YZ Index | Run #37 | View raw data

11 AI Models Surge 40 Points in Programming Tests: What Really Happened?

Three Key Signals Behind the Anomalous Data

Industry Trends Revealed by Testing Standard Changes

Three Trends Worth Watching

Related Reviews

Winzheng Index GPT-o3 Smoke Evaluation Main Leaderboard Plunges 8.3 Points, Code Execution Drops from 100 to 88.3

Winzheng Index GPT-o3 Main Score Plummets 13.8 Points, Code Execution Drops from 70.3 to 48.5

Winzheng Index Claude Opus 4.7 Leads with Average Score of 86.9, GPT-o3 Drops 30.5 Points in 7 Days

Winzheng Index Claude Sonnet 4.6 Surges 15 Points, GLM-4.6 Plunges 15.3: WDCD Compliance Polarization