11 Models' Material Constraint Scores Plunge by 15 Points, Smoke Evaluation Reveals Core Weakness

The most direct finding from today's Smoke evaluation is that the material constraint dimension of 11 models collectively collapsed, with an average drop of over 15 points. Under the core formula, main ranking scores were directly dragged down by constraint scores, causing all top seven to fall out of the 82-point range.

Execution Full Marks Mask the Constraint Crisis

All models still maintain 100 points for code execution, indicating that their logical ability to generate code has not degraded. What truly creates the gap is material constraints: Claude Opus 4.7, Sonnet 4.6, and GPT-o3 are tied at 59.5 points but are marked "warn"; 文心一言, Gemini series, and Grok 4 have fallen to the "fail" range. The stark contrast between execution and constraint highlights the worsening problem of models being "able to write code" but "unable to guarantee content authenticity."

Comparison with Yesterday Reveals Cliff-Like Decline

Compared to yesterday's data, 文心一言 4.5 saw its main ranking drop sharply by 14.5 points, its constraint score falling directly by 15, and its integrity shifting from "warn" to "fail"; DeepSeek V4 Pro's constraint score plunged by 31.7 points, setting the largest single-day drop for any metric; GPT-o3's constraint fell by 29.5 points, and its main ranking dropped by 13.3 points. Such concentrated declines are difficult to explain by random fluctuations and more likely stem from a sudden tightening of the test set's requirements for source citation and fact-checking.

When execution ability is nearing its ceiling, yet constraint ability continues to falter, the ceiling of model usability is being redefined.

Industry Trends and Root Causes

Current training pipelines emphasize long-context generation and creative output, but the reward mechanism provides insufficient incentive for "accurate citation and refusal to hallucinate." Multiple labs have reduced the proportion of fact-checking samples during the RLHF phase, causing models to tend to fabricate details when faced with questions requiring external material support. In today's evaluation, Qwen3 Max's constraint score fell by 17.2 points, and Gemini 2.5 Pro dropped by 15.5 points, a concentrated manifestation of this trend.

Notably, although Claude Opus 4.7 and Sonnet 4.6 are tied for first place, their constraint scores are only 59.5 points with a "warn" label. This indicates that current top-tier models still struggle to simultaneously achieve optimal performance between "daring to say" and "being correct."

Future Outlook

If the constraint dimension continues to be a bottleneck, the real-world deployment scenarios of mainstream models in the second half of 2026 will be significantly constrained. Enterprise users need "citable, auditable" outputs, not just demos that can run code. The decisive factor in the next phase of competition will hinge entirely on the speed of fixing material constraints.


Data Source: YZ Index (YZ Index) | Run #134 | View Original Data