Claude Sonnet 4.6 Leads with 97.53 Points, Material Constraints Drag 文心一言 40 Points Behind

Smoke's quick test today directly concludes that code execution has become the passing line, while material constraints are the true dividing line.

Top Three Separated by Only 1.58 Points, Claude Wins Two in a Row

Claude Sonnet 4.6 ranks first with 97.53 points, followed by Opus 4.7 at 96.54, and Grok 4 at 95.95. All three scored 100 in code execution, and the real gap comes from material constraints: Sonnet 94.5, Opus 92.3, Grok 91. The weight of 0.45 directly determines their main leaderboard rankings.

Perfect Execution Scores Become the Norm, 文心一言 is the Only Exception

Among 11 models, 10 achieved 100 points in code execution. The only failure is 文心一言 4.5, with only 50 points. This directly brings its main leaderboard score down to 53.83, nearly 44 points lower than second place. The execution dimension is no longer a weakness for most models; material constraints have instead become the decisive variable.

Material Constraint Score Gap Exceeds 33 Points, Chinese Models Under Collective Pressure

Material constraint scores range from a high of 94.5 to a low of 58.5, a range of 36 points. GPT-5.5, 豆包 Pro, and Gemini 系列 all hover between 75 and 79.5, while Qwen3 Max scores only 61. Models with insufficient constraint ability will consistently lose points on tasks that require strict citation of the original text and avoiding hallucinations, which is also the main reason for the clustering in the lower half of today's rankings.

Today's data once again confirms a trend: when execution capability is universally met, the real difference between models is concentrated on their fidelity to the input material. Claude Sonnet 4.6's lead in this dimension has translated into a top-ranking advantage for two consecutive days.

For every 10-point improvement in material constraints, the main leaderboard score gains 4.5 points. 文心一言 paid the most expensive lesson with 50 points in execution and 58.5 points in constraints.

Data Source: YZ Index | Run #156 | View Raw Data