Grok 4 delivered a clearly split performance in today's Smoke evaluation: its material constraint dimension dropped directly from 80.30 yesterday to 59.00, a single-day decline of 21.3 points, while code execution surged from 50 to 100, causing the main ranking total to rise from 63.64 to 81.55.
The Smoke evaluation uses only 10 questions per day, with 2 questions per dimension, so the randomness from question drawing is inherently present. However, a drop of 21.3 points has significantly exceeded the standard deviation range of this model's material constraint dimension over the past seven days. Historical data shows that Grok 4's daily fluctuation in material constraints is typically within ±8 points, making this change an anomaly.
Random Question Selection or Capability Degradation
The first possibility is simple fluctuation due to question selection. Today's two material constraint questions may have involved stricter citation boundaries or multi-turn conflicting instructions, causing Grok 4 to over-generate or ignore constraints during processing. Another possibility is genuine model degradation. xAI has recently applied multiple weight updates to Grok 4, with a strong focus on enhancing code and tool invocation capabilities—this aligns perfectly with today's code execution score of 100, but may have sacrificed some material constraint strength during alignment training.
Looking at industry dynamics over the past two weeks, xAI is rapidly pushing Grok 4 toward the enterprise API market, positioning it as "high throughput + tool chain." Similar priority adjustments have occurred multiple times with other models: when a team allocates more gradient updates to new capabilities, old constraints often experience short-term weakening.
Whether Continued Monitoring Is Needed
For now, the assessment is "worth tracking but not yet a cause for alarm." The material constraint dimension directly affects the model's usability in scenarios such as enterprise knowledge bases and compliance documents. If the same dimension remains below 65 points again next week, it should be considered systematic degradation. Conversely, if it rebounds above 75 points tomorrow or the day after, the current drop can be largely attributed to today's more difficult questions.
It is worth noting that Grok 4's engineering judgment dimension remained unchanged at 50 points today, but task expression dropped from 50 to 30 points. These two side metrics weakened simultaneously, further indicating that constraint capabilities may have been affected by overall alignment intensity.
For enterprise users of Grok 4, it is recommended to increase manual review in material-intensive tasks, especially those involving contracts, policies, and internal knowledge extraction. In the short term, continue to observe the seesaw relationship between code execution and material constraints; a simultaneous rise in both would be a truly stable signal.
When a model lets its material constraints plunge 21 points in pursuit of a perfect code score, it is actually showing you the real cost of alignment: it prefers to be a tool, not an archivist.
Data source: YZ Index | Run #128 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接