文心一言 Main Score Plunges 40.3 Points, Smoke Evaluation Reveals Dual Collapse in Execution and Constraint

Jun 22, 2026 27 Views - Read Source Winzheng Index

ERNIE Bot Material Constraints GPT-5.5 Smoke Light Test Model Fluctuations

In the Smoke lightweight evaluation on 2026-06-22, GPT-5.5 scored 100 points on the main leaderboard, 100 on Execution, and 100 on Constraint. GPT-o3 also scored 100 on the main leaderboard, 100 on Execution, and 100 on Constraint, tying for first with perfect scores.

Structural Characteristics of Perfect Score Models

GPT-5.5 and GPT-o3 both achieved 100 points in the two dimensions of code execution and material constraint, achieving a perfect balance under the core_overall formula: 0.55×Execution + 0.45×Constraint. Claude Opus 4.7 scored 99.01 on the main leaderboard, with 100 on Execution and 97.8 on Constraint, indicating a 0.2-point gap on the constraint side.

Differences in Strong/Weak Alignment of Execution and Constraint

Models ranked 4th to 7th—Claude Sonnet 4.6, 豆包Pro, Gemini 3.1 Pro, and Grok 4—all scored 98.83 on the main leaderboard, 100 on Execution, and 97.4 on Constraint. DeepSeek V4 Pro scored 97.8 on the main leaderboard, 100 on Execution, and 95.1 on Constraint, with the constraint side dragging down the overall score under the 0.45 weight.

Qwen3 Max scored 85.96 on the main leaderboard, 100 on Execution, and 68.8 on Constraint, with the constraint side significantly lower than previous models. Gemini 2.5 Pro scored 71.33 on the main leaderboard, only 50 on Execution, and 97.4 on Constraint, making the execution side the main weakness. 文心一言4.5 scored 47.98 on the main leaderboard, 50 on Execution, and 45.5 on Constraint, with both dimensions at low levels.

Abnormal Fluctuations Compared to Yesterday

文心一言4.5's main leaderboard dropped 40.3 points compared to yesterday, with Execution dropping 31.3 points and Constraint dropping 51.3 points. Gemini 2.5 Pro's main leaderboard dropped 28 points, with Execution dropping 50 points. Qwen3 Max's main leaderboard rose 5.1 points, but Constraint dropped 26.7 points, while Execution rose 31.2 points.

Claude Sonnet 4.6's main leaderboard rose 2.3 points, and Constraint rose 5.2 points. 豆包Pro's main leaderboard rose 2.2 points. There were a large number of models scoring 100 on Execution in today's evaluation, but the constraint side scores ranged from 100 to 45.5 points.

Structural Interpretation of Abnormal Signals

After Qwen3 Max's material constraint plunged 26.7 points, its main leaderboard still maintained 85.96 points, showing the supporting effect of Execution's 100 points on the overall score. Gemini 2.5 Pro's execution side fell from a possibly high level yesterday to 50 points, directly causing a 28-point drop in the main leaderboard. 文心一言4.5's Execution and Constraint both fell sharply, and core_overall, affected by the dual weights of 0.55 and 0.45, experienced the largest decline overall.

These fluctuations only reflect the results of the 10-question quick test on that day. The differences in the combination of Execution and Constraint determine the real-time ranking positions of each model in the Smoke evaluation.

The gap between Execution's 100 points and Constraint's 45.5 points determines 文心一言4.5's main leaderboard position of 47.98 points today.

Data source: 赢政指数 (YZ Index) | Run #191 | View Original Data

文心一言 Main Score Plunges 40.3 Points, Smoke Evaluation Reveals Dual Collapse in Execution and Constraint

Structural Characteristics of Perfect Score Models

Differences in Strong/Weak Alignment of Execution and Constraint

Abnormal Fluctuations Compared to Yesterday

Structural Interpretation of Abnormal Signals

Related Reviews

Winzheng Index Smoke Evaluation: Qwen3 Max Constraints Surge +23 Points, GPT-o3 Material Constraints Plunge 15.2 Points

Winzheng Index Claude Opus 4.7 and GPT-5.5 Tie for First on Smoke Leaderboard; Material Constraint Becomes the Biggest Differentiator

Winzheng Index Material Constraint Drops Collectively by 20 Points, Grok 4 Edges Claude with 81.55 Points to Top

Winzheng Index Qwen3 Max Plunges 19.2 Points on Main Leaderboard; Four Models Score Perfect in Execution and Constraint