In today's YZ Index Smoke evaluation, Grok 4's main score dropped from 97.98 to 82.73, a decrease of 15.3 points, and the code execution dimension fell directly from 100.00 to 68.60 points.
Fluctuation Range Caused by Single-Day 10-Question Draw
The Smoke evaluation has only 2 questions per dimension per day, totaling 10 questions. The code execution dimension lost 31.4 points in a single day, while material compliance rose from 95.50 to 100.00 points, and task expression rose from 91.30 to 100.00 points. This inverse movement between dimensions is consistent with the statistical characteristics of small-sample draws. The engineering judgment dimension dropped from 92.40 to 77.20 points, falling in the same direction as code execution but with a smaller magnitude.
When the question set includes code problems requiring multi-step debugging or specific library calls, single-day scores are prone to swings of over 30 points. The gap between yesterday's 100.00 and today's 68.60 is not uncommon under a 2-question sample.
Real Degradation or Draw Result
Currently, the data only shows single-day performance and cannot support the conclusion of model capability degradation. On the contrary, the two dimensions of material compliance and task expression set new highs for the day, indicating that the model has no systemic issues in constraint adherence and expression clarity. The integrity rating remains pass, also ruling out obvious violations or surges in hallucinations.
Only if low scores appear repeatedly in the same dimension over consecutive days could it point to model updates or post-training artifacts. A single Smoke result is closer to a lottery draw than a capability health check.
Whether to Pay Close Attention
From a stability perspective, the single-day drop of 31.4 points suggests that Grok 4 still has room for improvement in consistency on code execution tasks, but this has nothing to do with accuracy itself. The main score of 82.73 is still higher than the daily average of most similar models and has not triggered a sustained warning threshold.
It is recommended to extend the observation window to more than 7 days before judging whether there is a structural decline. For now, no conclusion should be drawn on Grok 4's overall capability.
A sharp swing in a single Smoke quick test often reveals question variance, not the ultimate upper limit of the model.
Data source: YZ Index | Run #206 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接