Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Jul 1, 2026 10 Views - Read Source Winzheng Index

Grok 4 Code Execution 单日波动 Smoke快测模型一致性

In today's YZ Index Smoke evaluation, Grok 4's main score dropped from 97.98 to 82.73, a decrease of 15.3 points, and the code execution dimension fell directly from 100.00 to 68.60 points.

Fluctuation Range Caused by Single-Day 10-Question Draw

The Smoke evaluation has only 2 questions per dimension per day, totaling 10 questions. The code execution dimension lost 31.4 points in a single day, while material compliance rose from 95.50 to 100.00 points, and task expression rose from 91.30 to 100.00 points. This inverse movement between dimensions is consistent with the statistical characteristics of small-sample draws. The engineering judgment dimension dropped from 92.40 to 77.20 points, falling in the same direction as code execution but with a smaller magnitude.

When the question set includes code problems requiring multi-step debugging or specific library calls, single-day scores are prone to swings of over 30 points. The gap between yesterday's 100.00 and today's 68.60 is not uncommon under a 2-question sample.

Real Degradation or Draw Result

Currently, the data only shows single-day performance and cannot support the conclusion of model capability degradation. On the contrary, the two dimensions of material compliance and task expression set new highs for the day, indicating that the model has no systemic issues in constraint adherence and expression clarity. The integrity rating remains pass, also ruling out obvious violations or surges in hallucinations.

Only if low scores appear repeatedly in the same dimension over consecutive days could it point to model updates or post-training artifacts. A single Smoke result is closer to a lottery draw than a capability health check.

Whether to Pay Close Attention

From a stability perspective, the single-day drop of 31.4 points suggests that Grok 4 still has room for improvement in consistency on code execution tasks, but this has nothing to do with accuracy itself. The main score of 82.73 is still higher than the daily average of most similar models and has not triggered a sustained warning threshold.

It is recommended to extend the observation window to more than 7 days before judging whether there is a structural decline. For now, no conclusion should be drawn on Grok 4's overall capability.

A sharp swing in a single Smoke quick test often reveals question variance, not the ultimate upper limit of the model.

Data source: YZ Index | Run #206 | View raw data

Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Fluctuation Range Caused by Single-Day 10-Question Draw

Real Degradation or Draw Result

Whether to Pay Close Attention

Related Reviews

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index GPT-5.5 Smoke Mainboard Drops 20.5 Points, Code Execution Falls from 100 to 50

Winzheng Index Claude Opus 4.7 Main Score Plunges 16 Points in Smoke Test, Code Execution Drops 27.2 in a Single Day

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Benchmark Drops 27.5 Points, Code Execution from 100 to 50