Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points

Jun 16, 2026 21 Views - Read Source Winzheng Index

Claude Opus 4.7 Code Execution Smoke Test 主榜排名异常波动

On June 16, 2026, the Smoke lightweight evaluation results showed that Claude Opus 4.7 scored 100 on the main leaderboard, with 100 in code execution and 100 in material constraint, and an integrity rating of pass, making it the only full-score model that day.

Score Structure Reveals Model Differentiation

The main leaderboard formula is core_overall = 0.55 × Code Execution + 0.45 × Material Constraint. Among the 11 models that day, 9 models maintained a score of 100 in material constraint, but only scored 50 or 0 in code execution, resulting in the main leaderboard converging in the 45-72.5 point range. ERNIE Bot 4.5 scored 66.7 in execution and 100 in constraint, with a main leaderboard score of 81.69, ranking second, its execution score being 16.7 points higher than the other models that scored 50.

Claude Sonnet 4.6, Doubao Pro, GPT-o3, Grok 4, and Qwen3 Max all scored 50 in execution and 100 in constraint, with a main leaderboard score of 72.5. DeepSeek V4 Pro and GPT-5.5 scored 50 in execution and 94.5 in constraint, with a main leaderboard score of 70.03. Gemini 2.5 Pro and Gemini 3.1 Pro scored 0 in execution and 100 in constraint, with a main leaderboard score of 45.

Abnormal Fluctuations Compared to Yesterday

Compared to yesterday's data, ERNIE Bot 4.5's main leaderboard rose by 31.1 points, execution increased by 16.7 points, and constraint increased by 48.7 points, with the integrity rating changing from pass to warn. Claude Opus 4.7's main leaderboard rose by 18.3 points, constraint increased by 40.7 points, and the integrity rating changed from warn to pass.

Nine models experienced a -50 point drop in code execution or a significant decline in the main leaderboard: GPT-5.5's main leaderboard dropped by 12.3 points, Grok 4 dropped by 10.1 points, Doubao Pro dropped by 9.9 points, Qwen3 Max dropped by 9.6 points, and Gemini 2.5 Pro dropped by 8.4 points. Claude Sonnet 4.6 and GPT-o3 both saw a drastic drop of 50 points in execution.

Signals of Imbalance Between Execution and Constraint

The material constraint dimension remained high, while code execution saw a collective decline, indicating that the test questions that day may have placed higher demands on code generation or debugging. The Gemini series' execution scores dropped to zero, with the main leaderboard falling by about 8 points compared to yesterday, showing a greater deviation between their output in code execution and the scoring criteria.

ERNIE Bot 4.5 was relatively outstanding in the execution dimension, possibly because it responded more stably to code-related questions among the 10 questions that day. Claude Opus 4.7 achieved full marks in both dimensions, indicating that it met the scoring requirements in both execution accuracy and material citation constraints.

Claude Opus 4.7's lead established by dual 100 scores will be difficult for models with execution scores generally staying below 50 to catch up in the short term.

Data source: YZ Index (YZ Index) | Run #182 | View Raw Data

Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points

Score Structure Reveals Model Differentiation

Abnormal Fluctuations Compared to Yesterday

Signals of Imbalance Between Execution and Constraint

Related Reviews

Winzheng Index 豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

Winzheng Index Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Winzheng Index Claude Opus 4.7 Material Constraint Plunges 16.5 Points, Main Ranking Drops from 96.83 to 90.78

Winzheng Index 9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50