Grok 4 Material Constraint Plummets 21.7 Points, Code Execution Rises to 100

Jun 15, 2026 509 Views - Read Source Winzheng Index

Grok 4 Material Constraints Smoke Test 单日波动主榜得分

In today's Smoke evaluation on the YZ Index, Grok 4's material constraint score dropped from 83.00 to 61.30, a decline of 21.7 points, while code execution score rose from 80.90 to 100.00.

Daily Score Comparison

Comparing yesterday's and today's data, Grok 4's engineering judgment rose from 55.00 to 63.50, task expression dropped from 93.00 to 86.50, the overall leaderboard score rose from 81.85 to 82.59, and the integrity rating remained pass. The material constraint drop far exceeded other dimensions, while code execution achieved a perfect score.

Analysis of Fluctuation Sources

The Smoke evaluation only includes 10 questions per day, with 2 questions per dimension, leading to considerable randomness in the sampling. The material constraint plunge may be due to today's questions requiring stricter source attribution or fact anchoring, causing the model's responses to contain more unconstrained content, resulting in a score decline. The perfect score in code execution indicates that the model achieved 100% accuracy on the questions sampled today, contrasting sharply with yesterday's 80.90.

If this change is attributed to real model degradation, there is currently insufficient multi-day data across the same dimension to support that. A single-day drop of 21.7 points falls more within the fluctuation range of sampling in the fast-test framework, rather than a systematic decline in capability.

Need for Ongoing Attention

The leaderboard score only increased by 0.7 points, with the large decline in material constraint partially offset by the gain from code execution. In the short term, this anomaly has limited impact on overall ranking, but if material constraint remains around 61 in the next evaluation, it may indicate a phased change in prompt comprehension or context constraint ability.

The slight fluctuations in engineering judgment and task expression remain within normal range, and the integrity rating stays at pass, with no threshold issues triggered.

The single-day 21.7-point drop in material constraint serves as a reminder that Smoke fast tests are better suited for capturing immediate status rather than being used as a basis for long-term capability conclusions.

If material constraint continues to fall below 70 in subsequent evaluations, it is recommended to switch to multi-day aggregated data for capability assessment.

Data source: YZ Index | Run #176 | View raw data

Grok 4 Material Constraint Plummets 21.7 Points, Code Execution Rises to 100

Daily Score Comparison

Analysis of Fluctuation Sources

Need for Ongoing Attention

Related Reviews

Winzheng Index Grok 4's Main Score Plummets 11.3 Points in Smoke Evaluation, Material Constraint Drops 18 Points in a Single Day

Winzheng Index Grok 4 Smoke Evaluation Main Score Plunges 17.5 Points, Material Compliance Drops 21.9 in a Single Day

Winzheng Index DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100