In today's YZ Index Smoke evaluation, Grok 4's Material Constraint score dropped from 96.70 to 71.10, a decrease of 25.6 points, but Code Execution rose from 66.70 to 100 points, and the overall leaderboard rose from 80.20 to 87 points.
Multi-Dimensional Sharp Fluctuations Point to Sampling Factors
The Smoke evaluation has only 10 questions per day, with 2 questions per dimension, so the standard deviation of daily scores is naturally large. Today, aside from Material Constraint, Grok 4 saw Code Execution increase by 33.3 points, Task Expression increase by 31.2 points, and Engineering Judgment increase by 12.5 points. Four dimensions simultaneously experienced changes exceeding 12 points, far beyond normal model iteration magnitude. This kind of full-dimensional drastic swing is more consistent with sample variance caused by random question selection, rather than a systematic degradation of model capability.
Specific Manifestations of the Material Constraint Decline
The Material Constraint dimension dropped from 96.70 to 71.10 points, meaning that in today's two grounding questions, the model exhibited significant factual deviation or information fabrication. Combined with the perfect Code Execution score, Grok 4 still maintains a high level in pure logical reasoning tasks, with the issue concentrated in scenarios requiring external knowledge grounding.
Whether Continued Attention is Needed
The fluctuation in a single day's Smoke quick test does not have long-term trend significance. Grok 4's overall leaderboard actually rose by 6.8 points, and its integrity rating remains pass, indicating that core capabilities are unaffected. It is recommended to extend the observation period to at least three consecutive days of data before judging whether Material Constraint has entered a true downward trend. Currently, there is no need to downgrade the conclusion on the model's overall capability.
If Material Constraint remains below 80 points for the next three days, it may reflect a phase adjustment by xAI in knowledge updating or alignment strategy; otherwise, it can be regarded as normal sampling noise.
Data source: YZ Index | Run #186 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接