In the June 30, 2026 Smoke Lite evaluation of the YZ Index, Gemini 3.1 Pro ranked first with a main score of 98.47 (Code Execution 100, Material Constraints 96.6).
This evaluation covered 11 models, with core_overall weighted as 0.55 × Code Execution + 0.45 × Material Constraints. Both Gemini 3.1 Pro and Grok 4 achieved a perfect execution score of 100, but Grok 4 scored only 95.5 in constraints, placing it 0.49 points behind in the main ranking.
Structural Differences Between Execution and Constraints
DeepSeek V4 Pro scored 96.65 on the main ranking, with 94.8 for execution and 98.9 for constraints. It had a clear advantage in constraints, 2.3 points higher than Gemini 3.1 Pro, but ranked third due to being 5.2 points lower in execution.
GPT-o3 and GPT-5.5 both scored 83.3 in execution, with constraints of 98.9 and 94.3 respectively. The former scored 2.07 points higher in the main ranking, showing that a 4.6-point gap in constraints directly determines the ranking.
Widespread Decline in Execution Scores Across Multiple Models
Compared to yesterday, Claude Opus 4.7's execution score dropped 27.2 to 72.8, causing a 16-point drop in the main ranking; Claude Sonnet 4.6's execution score dropped 25 to 75, with a 15.3-point drop in the main ranking. Qwen3 Max's execution score dropped 12.7 to 75, resulting in a 9.1-point drop in the main ranking. Gemini 2.5 Pro's execution score dropped 21.9 to 53.1, with a 13.6-point drop in the main ranking.
文心一言 4.5's execution score dropped 14.6 to 75, and its constraint score dropped 20.2 to 66.3, causing a 17.1-point drop in the main ranking, making it the model with the largest decline today.
Anomaly Signal Analysis
Both Claude models saw their execution scores drop by more than 25 points simultaneously, while their constraint scores remained at 97.7 and 91.7, indicating that material constraint capability was unaffected, with the issue concentrated on code execution consistency.
Gemini 2.5 Pro's constraint score remained at 96.6, on par with Gemini 3.1 Pro, but its execution score was only 53.1, dragging down the main ranking by 25.79 points, exposing a weakness in execution.
DeepSeek V4 Pro is the only model with an execution score below 95 that still made it into the top three; its constraint score of 98.9 offset the execution gap.
Today's data shows that the top two in the main ranking are models with an execution score of 100, while all models with execution scores below 75 fell out of the top five. GPT-o3 and DeepSeek V4 Pro, both with constraint scores of 98.9, ranked fourth and third respectively, proving that high constraint scores can provide a ranking buffer for models with moderate execution scores.
Execution volatility is reshaping the Smoke Lite leaderboard, providing buffer space for models with stable constraint scores.
The next Smoke evaluation will verify whether these execution score declines persist.
Data source: YZ Index | Run #205 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接