DeepSeek V4 Pro delivered an unusual performance in today's Smoke evaluation: its integrity rating went directly from Fail to Pass, and the main ranking score rose from 74.00 to 97.08, a single-day increase of 23.1 points. Among them, material constraint jumped from 70.00 to 93.50, while two side metrics (engineering judgment and task expression) each increased by 20 points.
Sampling Fluctuation or Genuine Improvement
The Smoke evaluation uses only 10 questions per day, with 2 questions being one-dimensional, resulting in a very small sample size, which inherently leads to a high standard deviation in daily scores. Yesterday's material constraint score was 70, today it is 93.5, a difference exceeding 23 points, which falls within the normal random range. The increase in code execution from 95 to 100 could also be attributed to the two simple calculation questions drawn today.
However, the integrity rating crossing the pass line directly from Fail points to a deeper issue. The integrity rating is an entry threshold; Fail typically indicates that the model has committed a clear violation in fact-checking or rejecting harmful requests. The fact that it turned to Pass today suggests that no similar errors were triggered in at least today's 10 questions. In the short term, such a "threshold crossing" is more likely due to the test questions avoiding sensitive scenarios, rather than a fundamental change in the model's underlying safety alignment.
Recent Industry Dynamics Impact
The DeepSeek team released an instruction fine-tuning patch for the V4 series last week, focusing on reducing hallucination rates and improving tool call accuracy. The patch notes explicitly mentioned "enhancing fact consistency checks." If the patch has been rolled out online, the recovery in today's material constraint and integrity rating could be related to this update. However, the patch also reduced the diversity of some open-ended responses, which aligns with the fact that both side metrics (engineering judgment and task expression, evaluated through AI-assisted assessment) remain at low scores of 30.
Another backdrop is that DeepSeek has been putting continuous pressure on competitors in terms of cost and open-source strategy, leading to rising community skepticism about the model's "balance between safety and capability." Today's integrity rating passing the line may temporarily alleviate some public pressure, but single-day data is insufficient to prove the problem has been resolved.
Need for Focused Attention
Yes, it is necessary. In particular, attention should be paid to the model's stability dimension. Currently, its stability is known to be only 31.7 points, indicating significant fluctuation in scores across repeated responses to similar questions. A single-day surge of 23 points in the main ranking is more likely another manifestation of this high volatility, rather than an actual upward shift in the capability curve.
It is recommended to continuously observe at least three days of Smoke and complete evaluation data. If the integrity rating consistently remains Pass and the material constraint stabilizes above 90 points, then consider this recovery as a trend. If Fail occurs again within three days, it can be judged as random noise.
A single day of Smoke is like a single tick on an electrocardiogram; what really matters is whether the QRS waveform returns to normal over multiple consecutive days.
Data Source: YZ Index | Run #130 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接