DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

May 29, 2026 504 Views - Read Source Winzheng Index

DeepSeek V4 Pro Code Execution Smoke Test 模型一致性工程判断

DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 87.99, a gain of 48.7 points; the code execution dimension soared from 20.00 to 100.00, while material constraints saw a modest increase of 10.5 points. However, engineering judgment (side index, AI-assisted evaluation) plummeted from 38.40 to 10.00, a drop of 28.4 points.

Sampling Fluctuation or Genuine Regression

The Smoke evaluation only consists of 10 questions per day, 2 questions per dimension, with an extremely small sample size, making large daily score fluctuations normal. However, this change exhibits two opposing extremes simultaneously: a perfect score in code execution alongside a collapse in engineering judgment, which is difficult to explain by mere sampling. The code execution questions may have coincidentally aligned with the model's recent training strengths, while the engineering judgment questions exposed its unstable decision-making under real-world constraints.

More notably, the integrity rating moved from fail to warn. Although still below the pass threshold, it has shifted from completely unqualified to the observation range. This indicates the model has improved in rejecting harmful requests or avoiding hallucination outputs, but has not simultaneously enhanced the systematic thinking required for engineering judgment.

Recent Industry Dynamics Supporting the Findings

Last week, the DeepSeek team released a code-specific fine-tuned version of the V4 series, focusing on strengthening LeetCode and multi-turn debugging scenarios. This aligns closely with today's score of 100 in code execution. However, community feedback during the same period shows that the model's performance on complex system design and multi-constraint trade-off tasks has declined, corroborating the engineering judgment score of 10.

From a standard deviation perspective, V4 Pro's stability is only 31.7 points, meaning scores fluctuate significantly across repeated tests of similar questions. This further supports the assessment of "genuine capability instability" rather than "one-time luck."

Should It Be a Focus of Concern?

Yes. The main index of 87.99 is highly misleading, but the engineering judgment score of 10 coupled with low stability indicates that the model still has significant shortcomings in usability for real engineering scenarios. It is recommended to conduct multi-round consistency testing before deploying in production environments, rather than relying solely on single-day Smoke scores.

A high score sometimes only reflects the two questions that were randomly selected; a low score is the true ceiling of the model.

Data source: YZ Index | Run #137 | View raw data

DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

Sampling Fluctuation or Genuine Regression

Recent Industry Dynamics Supporting the Findings

Should It Be a Focus of Concern?

Related Reviews

Winzheng Index DeepSeek V4 Pro Main Score Plummets 11.9 Points, Code Execution Drops 13.3

Winzheng Index Claude Opus 4.7 Main Benchmark Plummets 19.9 Points, Code Execution Drops 25 Points in a Single Day

Winzheng Index DeepSeek V4 Pro Code Execution Plunges by 23.7 Points, Main Leaderboard Drops 5.2 Points

Winzheng Index DeepSeek V4 Pro Drops 16.9 Points in Smoke Evaluation Main Rankings, Code Execution Down 28 Points in a Single Day