DeepSeek V4 Pro Smoke Test: Main Index Soars by 48.7, while Engineering Judgment Plunges by 28.4

DeepSeek V4 Pro delivered extremely polarized results in today's Smoke evaluation. The main index jumped from 39.26 to 87.99, a gain of 48.7 points; the code execution dimension soared from 20.00 to 100.00, while material constraints saw a modest increase of 10.5 points. However, engineering judgment (side index, AI-assisted evaluation) plummeted from 38.40 to 10.00, a drop of 28.4 points.

DeepSeek V4 Pro Code Execution Smoke Test
262

Smoke Evaluation Sees Across-the-Board Plunge: 11 Models Drop 42 Points on Average on Main Leaderboard, Code Execution Dimension Collapses for All

In the Smoke evaluation released at 3 AM today, all 11 mainstream models experienced a collective crash on the main leaderboard, with an average drop of 42 points. Gemini 3.1 Pro topped the list with 40.48 points, but this score itself dropped 33.5 points from yesterday, with only 20 points remaining in the execution dimension and 65.5 points in the constraint dimension.

Code Execution Material Constraints Gemini 3.1 Pro
315