4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Jun 24, 2026 30 Views - Read Source Winzheng Index

Code Execution Material Constraints ERNIE Bot 4.5 Claude Opus 4.7 Smoke Test

In the YZ Index June 24, 2026 Smoke Lightweight Evaluation, 文心一言4.5 main ranking score dropped 34.1 points to 64.63 from yesterday, and the execution dimension directly dropped from 100 to 50.

Obvious Gap Between Execution and Constraint

Today's top three on the main ranking—DeepSeek V4 Pro, Gemini 3.1 Pro, and Grok 4—all scored 100 in code execution and 100 in material constraint. Models ranked fourth to sixth—豆包Pro, Gemini 2.5 Pro, and GPT-5.5—maintain 100 in execution and 94.5 in constraint, all with a main ranking score of 97.53.

Eighth-place Claude Opus 4.7 and ninth-place Qwen3 Max both have a main ranking score of 72.5, both with 50 in execution and 100 in constraint. Tenth-place Claude Sonnet 4.6 has 50 in execution, 95.5 in constraint, and a main ranking score of 70.48. This combination of 50 in execution and near-perfect constraint scores forms the typical structure of the lower half of today's ranking.

Four Models' Execution Scores Collectively Halved

Comparison with yesterday shows that 文心一言4.5 execution dropped 50 points, Claude Opus 4.7 execution dropped 50 points, Claude Sonnet 4.6 execution dropped 50 points, and Qwen3 Max execution dropped 50 points. Simultaneously, the execution dimension of these four models experienced a cliff-like drop of 50 points, driving the main ranking down by 34.1, 27.5, 24.4, and 1.5 points respectively.

Material constraint dimension changes were relatively mild. Claude Sonnet 4.6 constraint actually increased by 6.9 points to 95.5, 文心一言4.5 constraint dropped 14.7 points to 82.5 and received a warn rating, while constraint changes for other models did not exceed 10 points.

Score Structure Reveals Capability Boundaries

The top seven models all maintained 100 in execution dimension, with constraint dimension fluctuating between 94-100, indicating these models maintain stable output in code execution tasks. Models ranked 8th to 11th collectively stayed at 50 in execution, but could achieve 82.5-100 in constraint, indicating that constraint tasks impose significantly less pressure on these models than execution tasks.

In the core_overall formula, code execution weight is 0.55, higher than material constraint's 0.45. Therefore, a drop in execution dimension from 100 to 50 has a greater direct impact on the total main ranking score than an equivalent change in constraint dimension. This is completely consistent with the decline magnitudes of the four models today.

The combination of 50 in execution and 100 in constraint has become the fixed pattern of the lower half of today's ranking.

文心一言4.5 simultaneously shows a warn signal and the largest decline, indicating significant fluctuations in both execution and constraint dimensions. The other three models with plummeting execution still maintain a pass rating, indicating that the integrity dimension did not trigger new thresholds.

Today's data only reflects results from a single 10-question quick test. The large fluctuations in execution dimension may stem from question difficulty distribution or differences in model output stability for this instance. Further data over multiple days is needed to verify whether this forms a sustained trend.

Data source: YZ Index (YZ Index) | Run #195 | View Raw Data

4模型执行分暴跌至50，文心一言主榜狂掉34.1分

Obvious Gap Between Execution and Constraint

Four Models' Execution Scores Collectively Halved

Score Structure Reveals Capability Boundaries

Related Reviews

Winzheng Index 9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

Winzheng Index 文心一言4.5 Main Leaderboard Plunges 10.4 Points, Task Expression Dimension Halved from 90 to 46.3

Winzheng Index Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points

Winzheng Index Claude Opus 4.7 Material Constraint Plunges 16.5 Points, Main Ranking Drops from 96.83 to 90.78