In today's Smoke evaluation on the YZ Index, Grok 4's material constraint score dropped from 83.00 to 61.30, a decline of 21.7 points, while code execution score rose from 80.90 to 100.00.
Daily Score Comparison
Comparing yesterday's and today's data, Grok 4's engineering judgment rose from 55.00 to 63.50, task expression dropped from 93.00 to 86.50, the overall leaderboard score rose from 81.85 to 82.59, and the integrity rating remained pass. The material constraint drop far exceeded other dimensions, while code execution achieved a perfect score.
Analysis of Fluctuation Sources
The Smoke evaluation only includes 10 questions per day, with 2 questions per dimension, leading to considerable randomness in the sampling. The material constraint plunge may be due to today's questions requiring stricter source attribution or fact anchoring, causing the model's responses to contain more unconstrained content, resulting in a score decline. The perfect score in code execution indicates that the model achieved 100% accuracy on the questions sampled today, contrasting sharply with yesterday's 80.90.
If this change is attributed to real model degradation, there is currently insufficient multi-day data across the same dimension to support that. A single-day drop of 21.7 points falls more within the fluctuation range of sampling in the fast-test framework, rather than a systematic decline in capability.
Need for Ongoing Attention
The leaderboard score only increased by 0.7 points, with the large decline in material constraint partially offset by the gain from code execution. In the short term, this anomaly has limited impact on overall ranking, but if material constraint remains around 61 in the next evaluation, it may indicate a phased change in prompt comprehension or context constraint ability.
The slight fluctuations in engineering judgment and task expression remain within normal range, and the integrity rating stays at pass, with no threshold issues triggered.
The single-day 21.7-point drop in material constraint serves as a reminder that Smoke fast tests are better suited for capturing immediate status rather than being used as a basis for long-term capability conclusions.
If material constraint continues to fall below 70 in subsequent evaluations, it is recommended to switch to multi-day aggregated data for capability assessment.
Data source: YZ Index | Run #176 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接