On June 16, 2026, the Smoke lightweight evaluation results showed that Claude Opus 4.7 scored 100 on the main leaderboard, with 100 in code execution and 100 in material constraint, and an integrity rating of pass, making it the only full-score model that day.
Score Structure Reveals Model Differentiation
The main leaderboard formula is core_overall = 0.55 × Code Execution + 0.45 × Material Constraint. Among the 11 models that day, 9 models maintained a score of 100 in material constraint, but only scored 50 or 0 in code execution, resulting in the main leaderboard converging in the 45-72.5 point range. ERNIE Bot 4.5 scored 66.7 in execution and 100 in constraint, with a main leaderboard score of 81.69, ranking second, its execution score being 16.7 points higher than the other models that scored 50.
Claude Sonnet 4.6, Doubao Pro, GPT-o3, Grok 4, and Qwen3 Max all scored 50 in execution and 100 in constraint, with a main leaderboard score of 72.5. DeepSeek V4 Pro and GPT-5.5 scored 50 in execution and 94.5 in constraint, with a main leaderboard score of 70.03. Gemini 2.5 Pro and Gemini 3.1 Pro scored 0 in execution and 100 in constraint, with a main leaderboard score of 45.
Abnormal Fluctuations Compared to Yesterday
Compared to yesterday's data, ERNIE Bot 4.5's main leaderboard rose by 31.1 points, execution increased by 16.7 points, and constraint increased by 48.7 points, with the integrity rating changing from pass to warn. Claude Opus 4.7's main leaderboard rose by 18.3 points, constraint increased by 40.7 points, and the integrity rating changed from warn to pass.
Nine models experienced a -50 point drop in code execution or a significant decline in the main leaderboard: GPT-5.5's main leaderboard dropped by 12.3 points, Grok 4 dropped by 10.1 points, Doubao Pro dropped by 9.9 points, Qwen3 Max dropped by 9.6 points, and Gemini 2.5 Pro dropped by 8.4 points. Claude Sonnet 4.6 and GPT-o3 both saw a drastic drop of 50 points in execution.
Signals of Imbalance Between Execution and Constraint
The material constraint dimension remained high, while code execution saw a collective decline, indicating that the test questions that day may have placed higher demands on code generation or debugging. The Gemini series' execution scores dropped to zero, with the main leaderboard falling by about 8 points compared to yesterday, showing a greater deviation between their output in code execution and the scoring criteria.
ERNIE Bot 4.5 was relatively outstanding in the execution dimension, possibly because it responded more stably to code-related questions among the 10 questions that day. Claude Opus 4.7 achieved full marks in both dimensions, indicating that it met the scoring requirements in both execution accuracy and material citation constraints.
Claude Opus 4.7's lead established by dual 100 scores will be difficult for models with execution scores generally staying below 50 to catch up in the short term.
Data source: YZ Index (YZ Index) | Run #182 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接