In the YZ Index Smoke Lite evaluation on June 13, 2026, Claude Opus 4.7 ranked first on the main leaderboard with 90.78 points, achieving 100 points in code execution and 79.5 points in material constraints.
Full Marks in Execution Common, Constraints Become Sole Dividing Line
Today, all top 10 models achieved full marks in code execution. The core_overall score formula of 0.55×Execution + 0.45×Constraints makes material constraints the only variable determining ranking. Claude Opus 4.7 scored 79.5 in constraints, 豆包Pro 78.5, and Gemini 2.5 Pro 77.3, with each 0.45-point gap directly corresponding to leads of 0.45, 0.45, and 0.23 points on the main leaderboard.
文心一言4.5 is the only model not achieving full marks in execution, scoring 50 in execution and 76.8 in constraints, with only 62.06 points on the main leaderboard—28.27 points behind the second place. This demonstrates that once execution falters, even decent constraint performance cannot secure a top-tier position.
Material Constraints Plunge Collectively, Anomalous Signals Concentrate
Compared to yesterday, eight models saw double-digit declines in material constraints. GPT-5.5 constraints plummeted 20.3 points to 66, dropping to sixth on the main leaderboard; Qwen3 Max constraints plunged 30.3 points to 64.5; and Gemini 3.1 Pro constraints dropped 34 points, causing its main leaderboard score to fall 13.9 points to 83.04. These declines far exceed fluctuations in execution, indicating that today's test materials posed significantly higher demands on the constraint dimension.
豆包Pro's main leaderboard score rose 23.9 points, primarily driven by a 47.5-point recovery in execution from yesterday's low, while constraints only fell 5 points, still landing it in second place. Gemini 2.5 Pro's execution rebounded 45 points, constraints fell 15.2 points, netting a gain of 17.9 points, showing that improvements in execution can partially offset constraint losses.
Structural Characteristics and Stability Concerns
The current landscape shows that code execution has entered a plateau, while material constraints have become a high-frequency volatile item. Although Claude Opus 4.7 also saw its constraint score drop 16.5 points, it still holds first place at 79.5 points, indicating a higher baseline for its constraint performance. With GPT-5.5 at 66 in constraints and a "warn" integrity rating, it faces greater risk exposure in an environment where multiple models' constraints are declining simultaneously.
文心一言4.5's execution score of 50 creates a gap with other models, exposing its persistent weakness in code execution tasks—not merely a one-day fluctuation.
The sharp fluctuations in material constraints are exposing the true upper limits of models. Full marks in execution are only an entry ticket; constraint stability is the final ticket.
Data Source: YZ Index (Winzheng Index) | Run #166 | View Raw Data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接