Weekly AI Model Test: GPT-4o Plummets 10 Points in Material Constraints, Domestic Wenxin Bucks the Trend

This week's YZ Index evaluation witnessed a dramatic turn: former champion GPT-4o plummeted 10.3 points in the Material Constraints dimension, directly causing its main board composite score to drop to 64.32, ranking last among 11 participating models. In stark contrast, Wenxin Yiyan 4.0 became the only model this week to achieve positive growth in main board core dimensions.

GPT-4o: A Warning from the Fallen King

GPT-4o's Material Constraints score fell from 59.6 to 49.3, marking one of the largest single-week drops since YZ Index began recording. The Material Constraints dimension tests a model's ability to answer accurately within given material boundaries, and this collapse means GPT-4o has severely degraded in handling tasks with clear boundaries.

More concerning is that GPT-4o's Code Execution score (76.6) ranks second to last among 11 models, only slightly higher than Qwen Max's 77.3. According to YZ Index's weighting formula (Main Board Score = 0.55×Code Execution + 0.45×Material Constraints), GPT-4o's 64.32 points now trail more than 20 points behind first-place 豆包 Pro's 85.03.

Data Comparison: GPT-4o's current Material Constraints 49.3 vs 豆包 Pro's 77.6, a gap of 28.3 points

Wenxin Yiyan: Steady Progress from a Domestic Model

Amid the general decline, Baidu's Wenxin Yiyan 4.0's performance stood out remarkably. Its Code Execution score improved from 79 to 85.8, an increase of 6.8 points, making it the only model this week to achieve positive growth in main board dimensions. This brought Wenxin Yiyan's main board composite score to 79.59, firmly holding 7th place.

Notably, Wenxin Yiyan's progress is not a flash in the pan. Data trends show its code execution capability is approaching the levels of DeepSeek V3 (87.3) and Claude Sonnet 4.6 (88.7), ranking second only to 豆包 Pro among domestic models.

Side Board Turbulence: Engineering Judgment Hit Hardest

This week's side board dimensions also experienced dramatic fluctuations. Grok 3's Engineering Judgment (Side Board, AI-assisted evaluation) plummeted 10.2 points to 35.3, while Claude Opus 4.6 also dropped 6 points. The Engineering Judgment dimension examines model performance in complex engineering decisions, and the simultaneous decline of these two top models may indicate increased difficulty in evaluation questions.

In contrast, the Task Expression dimension (Side Board, AI-assisted evaluation) saw a rare "collective rise": Claude Sonnet 4.6, DeepSeek V3, 豆包 Pro, Gemini 2.5 Pro, and Qwen Max all increased by 5 points. Such uniform gains suggest more of an adjustment in evaluation standards rather than genuine improvement in model capabilities.

Stability Crisis: DeepSeek V3's Hidden Concerns

While DeepSeek V3 ranks 4th on the main board, its stability score is only 31.7, meaning the model exhibits extreme score fluctuations when answering similar questions, with severely insufficient consistency. In comparison, 豆包 Pro's stability reaches 95.7, demonstrating the reliability expected of industrial-grade products.

GPT-o3's stability is even lower at 14.7, and combined with its Material Constraints score of 58.5, this highly anticipated new model clearly needs substantial optimization work.

Deep Analysis: The Value of YZ Index

This week's evaluation results once again prove the unique value of YZ Index. By focusing on two auditable dimensions—Code Execution and Material Constraints—YZ Index can objectively reflect actual changes in model capabilities rather than marketing rhetoric. GPT-4o's cliff-like drop and Wenxin Yiyan's contrarian rise are both genuine performances under this strict evaluation system.

Particularly noteworthy is that this week's top three (豆包 Pro, Grok 3, DeepSeek R1) all scored above 88 in Code Execution, but showed significant differences in Material Constraints scores (77.6, 79, 73.4), indicating that competition among top models has shifted from pure coding ability to more comprehensive understanding and constraint capabilities.

Prediction: With GPT-4o's collapse and the rise of domestic models, 2026 may become a pivotal year for AI landscape restructuring. Next week's evaluation will focus on whether GPT-4o can stop its decline and recover, and whether Wenxin Yiyan can maintain its upward momentum. In this smokeless AI arms race, stability and material constraint capabilities are becoming key factors determining victory.


Data source: YZ Index | Run #41 | View raw data