Wenxin 4.0's One Line of Code Exposes Fatal Flaw: When AI Can't Even Recognize a Dictionary

Mar 21, 2026 865 Views - Read Source Winzheng Index

文心一言4.0 编程能力代码生成 Model Degradation 百度AI

This might be the most absurd case of AI degradation I've ever seen: a model claiming to rival GPT-4 can't even handle Python's most basic dictionary comprehension. Even more bizarre, it outputs a list format and inexplicably adds two numbers.

Why Did an Elementary-Level Question Break Wenxin?

Let's look at the question that completely derailed Wenxin Yiyan 4.0: create a simple square mapping dictionary using dictionary comprehension. This is introductory Python knowledge that any beginner who's studied Python for a week could answer instantly. The correct answer should be {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}.

But Wenxin 4.0's answer is jaw-dropping:

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]
99 25
5

This response exposes three fatal problems: First, it outputs a list instead of a dictionary; second, the numbers "99 25" appear out of nowhere; third, there's a lonely "5" at the end. This isn't a simple formatting error, but a fundamental confusion in the model's understanding of Python's basic data structures.

Stability Plummets 3.7 Points: This Is No Accident

Even more concerning is the stability score plummeting from 41.7 to 38.0, a drop of 8.9%. In AI evaluation systems, stability below 40 points means the model has entered the "danger zone"—you never know what absurd answer it might give next.

Interestingly, Wenxin 4.0's overall programming score only dropped 2.3 points (from 84.7 to 82.4), indicating it still performs reasonably well on other programming tasks. But it's precisely this "selective amnesia" that's most terrifying—a model that can score high on complex algorithm problems completely fails on the most basic syntax questions. This unpredictability is catastrophic for practical applications.

Technical Analysis: Why "99 25 5"?

As an analyst who has tracked AI models for years, I speculate three possible reasons behind this bizarre output:

1. Training data contamination: The model might have seen similar number combinations in specific code snippets, causing erroneous associations during generation. "99 25" might come from some code example involving square calculations.

2. Attention mechanism failure: Dictionary comprehension requires the model to accurately understand the semantics of curly braces. When attention weight distribution goes wrong, the model might confuse list and dictionary representations.

3. Capability degradation from over-fine-tuning: Baidu might have accidentally damaged the model's understanding of basic concepts while fine-tuning for specific tasks. This is a common pitfall in large model development.

The Deeper Implications of This Incident

On the surface, this is just a programming question mistake. But deeper analysis reveals a common dilemma for Chinese large models: excessive pursuit of benchmark scores while neglecting the stability of fundamental capabilities.

Wenxin Yiyan 4.0 still showed slight improvement in the knowledge work dimension (+1.3 points), with cost-effectiveness remaining high at 99.1 points, suggesting Baidu might have adopted a "focus on the big, ignore the small" optimization strategy. But the problem is, for an AI model claiming to be "infrastructure," any collapse in fundamental capabilities is unacceptable.

More ironically, this error occurred in an area where Baidu should excel. As China's largest search engine company, Baidu has accumulated massive amounts of code data and should have natural advantages in programming tasks. But reality has given us a resounding slap in the face.

A Warning for the Industry

This incident sounds an alarm for the entire AI industry:

Basic testing cannot be ignored: Even the most advanced models must pass the most basic tests, otherwise they're castles built on sand
Stability matters more than peak performance: Users need predictable, reliable AI, not "Schrödinger's models" that are sometimes good, sometimes bad
Transparent degradation monitoring mechanisms: Model capabilities may degrade with updates, requiring comprehensive monitoring systems

If an AI can't even recognize a dictionary, why should we believe it can understand the world? This isn't just Wenxin Yiyan's problem, but a question the entire industry needs to ponder deeply. When we discuss AGI and surpassing humans, shouldn't we first ensure AI can consistently complete first-grade homework?

Data source: YZ Index | Run #33 | View raw data