9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderboard, forming a rare tie. Their common feature is that they all scored a perfect 100 on the Code Execution dimension, but consistently only 50 on the Material Constraint dimension.

The Real Signal Behind the Tie

The core_overall formula is 0.55 × Code Execution + 0.45 × Material Constraint. The nine models maxed out the Execution dimension but only got half on the Constraint dimension, indicating that code tasks in the test have been largely conquered by current mainstream models, while material constraint performance remains at a halfway level.

Claude Opus 4.7, DeepSeek V4 Pro, 豆包Pro, Gemini 2.5 Pro, Gemini 3.1 Pro, GPT-5.5, GPT-o3, Grok 4, Qwen3 Max all hit this score line. They maintained the same performance in yesterday's evaluation, with no fluctuation for two consecutive days.

Why Material Constraint Is Stuck at 50 Points Collectively

The Material Constraint dimension mainly assesses whether the model answers strictly according to the given material, without fabrication or overstepping. A score of 50 means that the model still exhibits slight deviations or supplements with external knowledge on half of the questions. This stands in stark contrast to the perfect score on Code Execution, indicating a clear gap between the model's ability to "write code" and "write code using only the given material."

文心一言4.5 scored only 30 on Material Constraint, becoming the only model below 50, directly pulling down the main leaderboard to 68.5. Claude Sonnet 4.6, on the other hand, only got 50 on the Execution dimension, with an overall main leaderboard score of 50, trailing the first-tier models by 27.5 points.

Industry Significance: Benchmark Has Entered a Saturation Phase

The fact that nine models simultaneously achieved full marks on the Execution dimension indicates that Smoke's current code questions have lost discriminative power for top-tier models. If future evaluations do not increase question difficulty or add more complex multi-file dependency scenarios, the Execution dimension will continue to see a cluster of perfect scores.

The prevalent score of 50 on Material Constraint suggests that alignment with "faithfulness to context" in model training is still insufficient. This is highly related to the RAG and Agent tool invocation scenarios that the industry currently emphasizes—the more the model dares to "improvise," the easier it loses points on the constraint dimension.

When nine top-tier models present an identical score distribution across the same set of 10 questions, what is truly exposed is not the models' capabilities, but the need for the evaluation itself to be iterated.

Today's evaluation showed no abnormal signals; all models had scores identical to yesterday's, with no new updates on the stability dimension for now.

In the short term, Material Constraint will become the core battleground for the next phase of model iteration; in the long term, Smoke needs to roll out harder Execution questions faster, otherwise the tie phenomenon will only become more frequent.


Data source: YZ Index | Run #148 | View Raw Data