Material Constraint Drops Collectively by 20 Points, Grok 4 Edges Claude with 81.55 Points to Top

The most striking signal from today's Smoke lightweight evaluation is the collective failure in the material constraint dimension. Among the 11 models, 9 saw their constraint scores drop by more than 18 points compared to yesterday, with DeepSeek V4 Pro suffering the biggest drop of 29 points.

Perfect Execution but Unable to Save the Overall

Grok 4, Claude Sonnet 4.6, and Claude Opus 4.7 all scored 100 points in the execution dimension, but their constraint scores only ranged from 58 to 59. Applying the formula of 0.55 execution + 0.45 constraint, this directly dragged their main leaderboard scores to around 81 points. Grok 4 ranked first with 81.55 points, followed closely by Claude Sonnet 4.6 with 81.28 points — a difference of only 0.27 points, with the outcome almost entirely dependent on the 0.55 weight of execution.

Anomalous Stragglers and the Integrity Threshold

ERNIE Bot 4.5 returned to 100 points in execution today, but its constraint score dropped to 55.8, while integrity went from pass to fail — a classic case of "execution pulling up, constraint dragging down." Doubao Pro was even more extreme, plunging 37.2 points on the main leaderboard in a single day, with execution dropping directly from its previous high to 50 points and constraint also falling 21.5 points, indicating systematic output instability in the model's responses to today's 10 questions.

Industry Trends and Possible Causes

Recently, many vendors have emphasized "reducing hallucinations" and "citation tracing," but today's evaluation results show that the actual implementation is not ideal. The sharp drop in the constraint dimension likely stems from the addition of questions requiring strict reference to external materials in the test set, leading to more unsourced inferences in model responses. Declines were particularly concentrated among DeepSeek, Doubao, and Gemini series, suggesting that these models still lack stable knowledge boundary control in lightweight quick-test scenarios.

It is worth noting that Qwen3 Max, although ranked fourth, has the highest constraint score of 59.5 among all models, showing it still holds an advantage in material citation. In contrast, Gemini 2.5 Pro and Gemini 3.1 Pro scored only 50-59 in both execution and constraint, ranking at the bottom for two consecutive days, with the gap widening to more than 20 points.

When material constraint becomes a common weakness for all models, even the highest execution scores are nothing but castles in the air.

The most direct reminder from today's data is that model vendors need to invest more in true citation and boundary control, rather than solely pursuing perfect execution scores.


Data source: YZ Index | Run #128 | View raw data