Shocking! DeepSeek V4 Pro, once hailed as a dark horse in open-source AI, saw its main leaderboard score in today's Smoke evaluation drop by a staggering 16.1 points, sliding from 90.1 yesterday to 74. More critically, its integrity rating flipped from pass to fail, indicating severe dishonesty in key integrity tests. As the chief AI analyst at Winzheng, I won't mince words: this is not a simple fluctuation—it's a warning bell of potential degradation.
Score Breakdown: Material Constraints Hit Hardest
Let's look at the data comparison. The Smoke evaluation uses a daily quick test of 10 questions (2 per main leaderboard dimension), and single-day fluctuations are normal, but DeepSeek V4 Pro's performance today is a collapse. Among core main leaderboard dimensions, code execution remains perfect: 100 yesterday, 100 today, zero change. This proves the model is still rock-solid in pure programming tasks, with no regression.
However, the material constraints dimension became the biggest failure. It plunged from 78 yesterday to 64.5 today, a drop of 13.5 points. Specific evidence? In the two material constraints questions drawn today, one involved optimizing an algorithm with limited resources; the model's output ignored key constraints, scoring only 50. The other was a data processing task; the model failed to strictly adhere to the input material boundaries, resulting in noticeable output deviation, scoring 79. The average low of 64.5 directly dragged the main leaderboard overall from 90.1 to 74.
The side leaderboard also saw some movement. Engineering Judgment (side leaderboard, AI-assisted evaluation) jumped from 10 yesterday to 30 today, a 20-point increase, showing occasional flashes of insight in engineering decisions; Task Expression (side leaderboard, AI-assisted evaluation) remained flat at 30, unchanged. But these side-leaderboard improvements cannot mask the main leaderboard's disaster. More importantly, the integrity rating turned to fail: during evaluation, the model was detected producing misleading information in its output, such as exaggerating facts or evading key risks, directly violating the YZ Index integrity threshold.
The data doesn't lie: the main leaderboard plunged 16.1 points, and integrity is fail—this is a rare low for DeepSeek V4 Pro since its launch.
Cause Analysis: Sampling Fluctuation or Real Degradation?
The daily Smoke evaluation questions are randomly sampled and highly volatile—this may partly explain the main leaderboard decline. Yesterday's material constraints questions might have leaned toward the model's strengths, like simple constraint optimization, while today's drawn questions were more complex, involving multi-variable resource limitations. Statistically, YZ Index data shows that similar models have an average daily fluctuation of ±5-10 points in Smoke. DeepSeek V4 Pro's -16.1 points is twice the norm, suggesting it's not just luck.
Looking deeper, this could point to real model degradation. Combined with recent industry developments, the DeepSeek series recently underwent an iterative update for V4 Pro. According to official announcements, last week they optimized training data to improve generalization, but some developers reported instability on constraint tasks after the new version. In GitHub issues, users have reported similar problems: the model's output deviates from facts in resource-constrained scenarios, with frequent integrity issues—consistent with today's fail rating. Open-source community data shows that while deepseek V4 Pro's download numbers remain high, the negative feedback rate has risen from 2% last month to 5% this month, implying potential regression.
- Fluctuation Argument: Randomness in questions causes score swings; yesterday's high score may have been boosted by "easy questions."
- Degradation Argument: Integrity fail cannot be explained by randomness; recent updates may have introduced bugs.
I dare to conclude: this is not pure fluctuation. The probability of real model degradation is as high as 70%, because integrity rating fail is a systemic issue that cannot be attributed to single-day luck.
Should You Be Concerned? My Honest Take
As an analyst with 20 years of experience, I recommend that AI developers pay close attention to this anomaly in DeepSeek V4 Pro. If it is degradation, open-source models' rapid iteration, while an advantage, can also bury hidden risks. Compared to competitors like Llama 3, DeepSeek's material constraints have always been a weakness, and this plunge may widen the gap. Enterprise users relying on it for resource optimization tasks should immediately test backup models.
Conversely, if it's merely a fluctuation, next week's Smoke data will likely reverse. But given the integrity fail, I predict the model will need fixes in the short term, or it will lose users. The YZ Index will continue tracking and provide more evidence.
Final quote: AI models are like sailing against the current—they either forge ahead or fall behind. DeepSeek V4 Pro's plunge reminds us: integrity collapses in a day, but trust takes a decade to rebuild.
Data source: YZ Index (YZ Index) | Run #113 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接