DeepSeek gains 5 points but fails: 10-question Smoke test alarm

May 15, 2026 25 Views - Read Source Winzheng Index

DeepSeek V4 Pro 诚信评级 Smoke评测模型退化材料约束

DeepSeek V4 Pro's most paradoxical scene yet: the main benchmark rose by 5 points, but the integrity rating dropped from pass to fail. This is not an ordinary fluctuation in scores, but a typical alarm that "capability appears to strengthen, but trustworthiness at the admission gate is breached."

First, look at the raw data: the score increase hides hard flaws

Today's Smoke evaluation is a quick test of 10 questions daily, with 2 questions per dimension. The sample is very small, and daily draw fluctuations must be factored into the explanation. But the data itself is glaring enough: code execution jumped from 69.00 to 100.00, a single-day increase of 31 points; material adherence dropped from 69.00 to 64.50, a decline of 4.5 points; the main benchmark rose from 69.00 to 74.00, an increase of 5 points.

The main benchmark only looks at two auditable dimensions: code execution and material adherence. Therefore, DeepSeek V4 Pro today is not "improving across the board," but rather code execution soared sharply, offsetting the decline in material adherence.

The real problem lies at the admission layer: integrity rating pass→fail. According to the YZ Index methodology, the integrity rating is not a bonus item but a threshold. That means even if a model's main benchmark rises, as long as its integrity rating is fail, it should not simply be interpreted as "performing better today." This is like a race car going faster on the track but its brake system alarming—you can't just stare at the stopwatch.

Draw fluctuations can explain part of it, but not all

The 10-question Smoke test does tend to amplify fluctuations. Code execution only has 2 questions; if the draw happens to favor the model's strengths, 69 to 100 is not surprising. Engineering judgment dropped from 38.40 to 10.00, task expression from 50.00 to 30.00—both belong to side benchmarks, both are Engineering judgment (side benchmark, AI-assisted evaluation) and Task expression (side benchmark, AI-assisted evaluation), and are also susceptible to question type influence.

But the integrity rating going from pass to fail is different in nature. It is usually not about "answering not elegantly enough," but about triggering more fundamental trustworthiness issues: possibly failing to acknowledge uncertainty as constrained, overstepping when citing materials, or having hard flaws in scenarios requiring refusal, explanation of limitations, or consistent messaging. Combined with the simultaneous 4.5-point drop in material adherence, today looks more like "cracks appearing in constraint adherence" rather than simply encountering difficult questions.

Place it in the context of recent industry dynamics: speed and constraint are pulling against each other

The DeepSeek series has been in a high-focus zone recently: low-cost inference, developer calls, open-source ecosystem, and iteration speed are all industry discussion hot spots. The problem is that the faster a model enters high-frequency applications, the easier it is to expose a contradiction: code questions can be quickly patched up by training and toolchains, but material adherence, boundary awareness, and trustworthy output often rely more on post-training strategies, evaluation closed loops, and online policy stability.

This data exactly hits that contradiction: a perfect score in code execution shows a clear strength in verifiable tasks; but the decline in material adherence and the fail integrity rating indicate risk in "whether it should say something, whether it can say it that way, whether it sticks strictly to input materials." For enterprise users, the latter is often more critical than the former. Code errors can still be tested; fabricated evidence can directly derail business decisions.

My judgment: needs attention, but not yet time to conclude degradation

The conclusion is clear: this is not a noise that can be ignored, but neither can we judge DeepSeek V4 Pro's actual degradation based on a single day of 10 questions. A reasonable approach is to continuously observe for 3 to 5 days, focusing on three things:

Whether the integrity rating returns to pass, or continues to fail or warn;
Whether material adherence continues to decline, especially if it coincides with integrity issues;
Whether the perfect score in code execution is reproducible or just a one-time question-type advantage.

If in the following days the main benchmark stays above 70 but the integrity rating repeatedly fails, Winzheng will classify it as a "high-capability, high-risk" model: suitable for sandbox, testing, and code assistance, but not for direct entry into serious business closed loops.

Today's golden quote: It's not surprising that a model runs fast; the key is whether it can brake before the red line.

Data source: YZ Index | Run #117 | View raw data

DeepSeek gains 5 points but fails: 10-question Smoke test alarm

First, look at the raw data: the score increase hides hard flaws

Draw fluctuations can explain part of it, but not all

Place it in the context of recent industry dynamics: speed and constraint are pulling against each other

My judgment: needs attention, but not yet time to conclude degradation

Related Reviews

Winzheng Index DeepSeek V4 Pro Main Score Plummets 16 Points! Integrity Rating Collapses, Is the Model Truly Degrading?

Winzheng Index Claude Opus 4.7 Material Constraints Plunge 15.8 Points: Model Degradation or Sampling Farce?

Winzheng Index Claude Sonnet 4.6 Material Grounding Plunges 27.5 Points, But Main Leaderboard Rises Against the Trend by 1.4 Points?

Winzheng Index Two Zero-Execution Shocks, Claude Holds at 88.75