Grok 4 Code Execution Plunges 19.1 Points, Main Ranking Drops 7.7 – Sampling or Degradation?

In the June 2026 YZ Index test of 11 models, Grok 4's Smoke evaluation code execution score today dropped from 100.00 yesterday to 80.90, and its main ranking overall fell from 89.56 to 81.85.

Inherent Variation in Small-Sample Quick Tests

The Smoke evaluation has only 10 questions per day, with 2 questions per dimension. The daily standard deviation for code execution typically ranges from 8 to 12 points. A 19.1-point drop falls at the upper edge of normal variation. Meanwhile, Material Constraints rose from 76.80 to 83.00, and Task Expression rose from 90.50 to 93.00, indicating no systemic collapse in overall model output.

Concurrent Sharp Drop in Engineering Judgment

Engineering Judgment fell from 88.00 to 55.00, a 33-point drop far exceeding the code execution decline. Two side-benchmark dimensions simultaneously showed significant decreases, suggesting that the questions drawn today may have placed higher demands on reasoning chain length and multi-step verification, rather than a pure impairment in code generation capability.

Assessment of True Degradation Probability

If the model had truly degraded, Material Constraints would typically decline as well. However, Material Constraints actually rose by 6.2 points, and the Integrity Rating remained at "pass," indicating that outputs remained constrained and no surge in hallucinations occurred. The available data support variance due to question sampling rather than a degradation in underlying capability.

Need for Continued Monitoring?

A single day's Smoke data is insufficient to determine degradation. It is recommended to observe the moving average of code execution and engineering judgment over three consecutive days or more. Only if both dimensions remain below the previous day's mean by 2 standard deviations should a full 10-question retest be initiated. For now, this is merely recorded as an anomaly and not added to the priority tracking list.

A 19.1-point plunge is more likely the result of a 10-question draw than a collapse of the model itself.

Data source: YZ Index | Run #170 | View raw data