Material Constraint Plunged by 39 Points, All 11 Models on YZ Index Main Leaderboard Decline

Jun 15, 2026 376 Views - Read Source Winzheng Index

Material Constraints Grok 4 Smoke Light Test 主榜波动执行满分

In the Smoke Lite Evaluation on June 15, 2026, Grok 4 ranked first among the 11 models with a main score of 82.59 (Execution 100, Constraint 61.3 pass), but its Material Constraint score dropped sharply by 21.7 points from the previous day.

Constraint Shortcomings of Models with Perfect Execution Scores

The top eight models all achieved a perfect score of 100 in the execution dimension, but their Material Constraint scores were concentrated in the range of 51.3 to 61.3. Grok 4, Doubao Pro, GPT-5.5, and Qwen3 Max all scored 100 in execution, with constraint scores of 61.3, 60.8, 60.8, and 60.3 respectively, and main leaderboard scores of 82.59, 82.36, 82.36, and 82.14. Claude Opus 4.7 had an execution score of 100 and a constraint score of 59.3, a main score of 81.69, and an integrity rating of warn.

The bottom three models only scored 50 in execution: Gemini 2.5 Pro had a main score of 53.38 (constraint 57.5), Gemini 3.1 Pro had a main score of 53.06 (constraint 56.8), and ERNIE Bot 4.5 had a main score of 50.59 (constraint 51.3). The clear divergence between execution and constraint led to a gap of over 30 points between the top eight and bottom three on the main leaderboard.

Sharp Fluctuations Compared to Yesterday

Compared with yesterday's data, Gemini 3.1 Pro's main score dropped by 39.4 points, execution by 47.5 points, and constraint by 29.5 points. Qwen3 Max's main score rose by 29.3 points, but its constraint dropped by 30.7 points. Doubao Pro's main score rose by 23.1 points, with constraint dropping by 24 points. DeepSeek V4 Pro's main score rose by 16.2 points, with constraint dropping by 39.2 points. Gemini 2.5 Pro's main score dropped by 17.2 points.

Multiple drops of over 30 points occurred in the Material Constraint dimension: Claude Sonnet 4.6 fell by 38.7 points, Claude Opus 4.7 by 38 points, DeepSeek V4 Pro by 39.2 points, and ERNIE Bot 4.5 by 32.5 points. For models that maintained an execution score of 100, the decline in constraint directly pulled down the main leaderboard score.

Direct Impact of Score Structure Differences

The core_overall formula is 0.55 × execution + 0.45 × constraint. For models with an execution score of 100, each 1-point drop in constraint reduces the main score by approximately 0.45 points; for models with an execution score of 50, the weight of constraint impact is relatively higher. Today, the median constraint of the top eight models is about 57 points, and the median constraint of the bottom three models is about 56.8 points, with a 50-point gap in execution between the two groups. Ultimately, the difference in the main leaderboard mainly comes from the execution dimension.

Anomalous signals were concentrated in Material Constraint, with all 11 models experiencing declines in this dimension, of which 10 models dropped by more than 20 points. Although Grok 4 still ranks first, its constraint of 61.3 is already close to the passing line.

A perfect execution score has become the standard, and Material Constraint is becoming the key variable determining rankings.

Today's Smoke data only reflects the results of a single-day 10-question quick test, and the stability dimension was not included in this Lite Evaluation. The focus of subsequent observation will be on the recovery speed of each model's constraint score.

Data source: YZ Index | Run #176 | View raw data

Material Constraint Plunged by 39 Points, All 11 Models on YZ Index Main Leaderboard Decline

Constraint Shortcomings of Models with Perfect Execution Scores

Sharp Fluctuations Compared to Yesterday

Direct Impact of Score Structure Differences

Related Reviews

Winzheng Index Grok 4 Main Score Plunges 8.4 Points, Material Constraint Drops 17.6 Points in a Single Day

Winzheng Index DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

Winzheng Index Grok 4's Main Score Plummets 11.3 Points in Smoke Evaluation, Material Constraint Drops 18 Points in a Single Day

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100