In the June 19, 2026 YZ Index Smoke lightweight evaluation, Gemini 3.1 Pro ranked first with 99.28 points on the main leaderboard, 100 points in code execution, and 98.4 points in material constraints. The weighted structure of 0.55× execution + 0.45× constraints underscores its balanced advantage across both dimensions.
Constraint Differentiation Among Execute Score Leaders
Among the 11 models evaluated today, 10 models—Gemini 3.1 Pro, Claude Opus 4.7, DeepSeek V4 Pro, Qwen3 Max, Gemini 2.5 Pro, Grok 4, GPT-o3, GPT-5.5, 豆包 Pro, and Claude Sonnet 4.6—all achieved a perfect 100 in code execution. Ranking differences were almost entirely determined by material constraints: Gemini 3.1 Pro’s 98.4 points in constraints created a 2.53-point gap over the second-place Claude Opus 4.7 and DeepSeek V4 Pro.
ERNIE 4.5 was the only model without a perfect execution score, with 94.1 in code execution, 92.2 in material constraints, and a main leaderboard score of 93.25. Its execution shortfall prevented it from breaking into the top six, but its constraint performance still outperformed GPT-o3’s 84.8.
Sharp Fluctuations Compared to Yesterday
Compared to yesterday’s data, Qwen3 Max’s material constraints improved by 23 points, propelling its main leaderboard score from approximately 86.95 to 97.35 and moving its ranking up to fourth. Grok 4’s constraints rose by 19.6 points, lifting its main leaderboard score by 8.8 points to 95.82. Both models maintained 100 in execution, and the single-day improvements on the constraint side directly translated into ranking gains.
Reverse fluctuations were equally pronounced. GPT-o3’s material constraints dropped by 15.2 points, reducing its main leaderboard score by 6.8 points to 93.16; 豆包 Pro’s constraints fell by 15.9 points, lowering its main leaderboard score by 7.2 points to 92.85. Claude Sonnet 4.6’s constraints decreased by 14 points, dropping its main leaderboard score by 6.3 points to 92.53.
Structural Interpretation of Anomalous Signals
GPT-o3 and 豆包 Pro’s constraint collapses occurred while maintaining 100 in execution, indicating the problem is concentrated in the material constraint stage. With a weight of 0.45, a drop of approximately 15 points in constraints cost the main leaderboard about 6.8–7.2 points, consistent with the actual ranking decline magnitude. Both models’ constraints were already in the lower–middle tier yesterday; further decline today widened the gap with the top five to more than 5 points.
Qwen3 Max and Grok 4’s constraint improvements took a different path. Both had already achieved perfect execution, so the improvements on the constraint side directly boosted their main leaderboard scores without causing fluctuations on the execution side, indicating relatively stable structures.
Single-day swings of more than 15 points in constraints have become a key signal for distinguishing a model’s real-world usability.
Today, the top six models all scored above 90.7 in constraints, while the bottom five models’ constraints ranged from 83.4 to 92.2. Perfect execution has become the norm, and the stability and upper limit of material constraints are increasingly determining the final daily Smoke ranking landscape.
Data source: YZ Index (赢政指数) | Run #187 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接