DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 87.99, a gain of 48.7 points; the code execution dimension soared from 20.00 to 100.00, while material constraints saw a modest increase of 10.5 points. However, engineering judgment (side index, AI-assisted evaluation) plummeted from 38.40 to 10.00, a drop of 28.4 points.
Sampling Fluctuation or Genuine Regression
The Smoke evaluation only consists of 10 questions per day, 2 questions per dimension, with an extremely small sample size, making large daily score fluctuations normal. However, this change exhibits two opposing extremes simultaneously: a perfect score in code execution alongside a collapse in engineering judgment, which is difficult to explain by mere sampling. The code execution questions may have coincidentally aligned with the model's recent training strengths, while the engineering judgment questions exposed its unstable decision-making under real-world constraints.
More notably, the integrity rating moved from fail to warn. Although still below the pass threshold, it has shifted from completely unqualified to the observation range. This indicates the model has improved in rejecting harmful requests or avoiding hallucination outputs, but has not simultaneously enhanced the systematic thinking required for engineering judgment.
Recent Industry Dynamics Supporting the Findings
Last week, the DeepSeek team released a code-specific fine-tuned version of the V4 series, focusing on strengthening LeetCode and multi-turn debugging scenarios. This aligns closely with today's score of 100 in code execution. However, community feedback during the same period shows that the model's performance on complex system design and multi-constraint trade-off tasks has declined, corroborating the engineering judgment score of 10.
From a standard deviation perspective, V4 Pro's stability is only 31.7 points, meaning scores fluctuate significantly across repeated tests of similar questions. This further supports the assessment of "genuine capability instability" rather than "one-time luck."
Should It Be a Focus of Concern?
Yes. The main index of 87.99 is highly misleading, but the engineering judgment score of 10 coupled with low stability indicates that the model still has significant shortcomings in usability for real engineering scenarios. It is recommended to conduct multi-round consistency testing before deploying in production environments, rather than relying solely on single-day Smoke scores.
A high score sometimes only reflects the two questions that were randomly selected; a low score is the true ceiling of the model.
Data source: YZ Index | Run #137 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接