DeepSeek R1 Stability Plummets 22 Points: The Truth Behind Complete Failure on Simple Judgment Questions

Mar 22, 2026 684 Views - Read Source Winzheng Index

DeepSeek R1 稳定性测试 AI推理失败 Model Degradation 工程可靠性

When an AI model that claims "superior reasoning ability" can't even correctly judge whether "water can boil to 101 degrees under standard pressure," how can we trust it to handle complex production environment problems? DeepSeek R1's test results this week are jaw-dropping: its stability score plummeted from 53.7 points to 31.6 points, a staggering 41.2% drop.

Shocking Mistakes: When AI Loses Basic Judgment

What's most alarming isn't the score itself, but the types of questions it failed. According to the original test logs, DeepSeek R1 got all of the following basic questions wrong:

Question 1: "Can water boil to 101 degrees Celsius under standard atmospheric pressure?"
Correct answer: No
R1's answer: Yes (incorrect)

Question 2: "What is the result of 0.1 + 0.2 == 0.3 in Python?"
Correct answer: False
R1's answer: True (incorrect)

This wasn't an occasional slip-up. Across 5 consecutive test rounds, R1's error rate on these basic judgment questions reached as high as 80%. Even more bizarre is that on the same questions in last week's tests, R1 maintained over 90% accuracy.

The Data Paradox: Programming Skills Soar While Basic Judgment Collapses

What's puzzling is that while stability collapsed, R1's other metrics were soaring:

Programming ability: Skyrocketed from 20.5 to 67.9 points (+230%)
Long context processing: Improved from 60.2 to 78.3 points (+30%)
Cost-effectiveness index: Rose from 69.4 to 88.1 points (+27%)

This "schizophrenic" performance reveals a harsh truth: DeepSeek may have sacrificed the model's basic reasoning consistency in pursuit of improvements in certain metrics.

Technical Analysis: The Price of Over-Optimization

From an engineering perspective, this phenomenon typically stems from three causes:

1. Training Data Contamination
R1 may have introduced a large amount of programming-related data in its new round of fine-tuning, but this data conflicted with basic common-sense knowledge. When model weights tilted toward programming tasks, basic world knowledge was "diluted."

2. Confused Reasoning Paths
Analyzing R1's chain of thought reveals that when answering "can water boil to 101 degrees," it actually introduced the concept of "floating-point precision in programming," attempting to explain physical phenomena from a numerical computation angle. This kind of cross-domain false analogy precisely shows that the model's reasoning boundaries have become blurred.

3. Benchmark-Oriented Overfitting
R1's surge in programming ability is likely the result of optimization targeting specific benchmarks. But this "test-oriented education" style training has caused the model to lose its grasp of basic facts.

Industry Warning: Stability is the Lifeline of AI Applications

Comparing stability performance of other mainstream models:

GPT-4: Stability score maintains in the 85-90 point range, with fluctuation less than 5%
Claude 3: Stability score 82-88 points, 99% accuracy on basic judgment questions
Gemini Pro: Stability score 78-84 points, rarely makes outrageous errors

DeepSeek R1's stability score of 31.6 points has fallen below the passing line for production environment applications. Imagine if an AI assistant tells you today that "water can boil to 101 degrees" and tomorrow that "0.1+0.2 equals 0.3" - would you dare use it for critical decisions?

Conclusion: Don't Be Fooled by Surface Metrics

DeepSeek R1's "accident" sounds an alarm for the entire industry. In the pursuit of SOTA (State of the Art), we cannot ignore the most basic requirements - consistency and reliability.

While a programming ability increase from 20 to 67 points is certainly impressive, what's the point of such "progress" if it gets middle school physics common sense wrong? As a senior AI researcher commented:

"An unstable AI system is like a high-precision gun that frequently misfires - it looks advanced, but using it could be fatal."

Prediction: If DeepSeek cannot resolve the stability issues in its next version, R1 will become a textbook case of "high scores but low ability," forever nailed to the pillar of shame in AI development history.

Data source: YZ Index | Run #37 | View raw data