9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

Jun 5, 2026 538 Views - Read Source Winzheng Index

Code Execution Material Constraints Claude Opus 4.7 Smoke Test 模型饱和

The results of the Smoke Lite evaluation on June 5, 2026, show that 9 out of 11 models tied at 77.5 on the main leaderboard, forming a rare tie. Their common feature is that they all scored a perfect 100 on the Code Execution dimension, but consistently only 50 on the Material Constraint dimension.

The Real Signal Behind the Tie

The core_overall formula is 0.55 × Code Execution + 0.45 × Material Constraint. The nine models maxed out the Execution dimension but only got half on the Constraint dimension, indicating that code tasks in the test have been largely conquered by current mainstream models, while material constraint performance remains at a halfway level.

Claude Opus 4.7, DeepSeek V4 Pro, Doubao Pro, Gemini 2.5 Pro, Gemini 3.1 Pro, GPT-5.5, GPT-o3, Grok 4, Qwen3 Max all hit this score line. They maintained the same performance in yesterday's evaluation, with no fluctuation for two consecutive days.

Why Material Constraint Is Stuck at 50 Points Collectively

The Material Constraint dimension mainly assesses whether the model answers strictly according to the given material, without fabrication or overstepping. A score of 50 means that the model still exhibits slight deviations or supplements with external knowledge on half of the questions. This stands in stark contrast to the perfect score on Code Execution, indicating a clear gap between the model's ability to "write code" and "write code using only the given material."

ERNIE Bot 4.5 scored only 30 on Material Constraint, becoming the only model below 50, directly pulling down the main leaderboard to 68.5. Claude Sonnet 4.6, on the other hand, only got 50 on the Execution dimension, with an overall main leaderboard score of 50, trailing the first-tier models by 27.5 points.

Industry Significance: Benchmark Has Entered a Saturation Phase

The fact that nine models simultaneously achieved full marks on the Execution dimension indicates that Smoke's current code questions have lost discriminative power for top-tier models. If future evaluations do not increase question difficulty or add more complex multi-file dependency scenarios, the Execution dimension will continue to see a cluster of perfect scores.

The prevalent score of 50 on Material Constraint suggests that alignment with "faithfulness to context" in model training is still insufficient. This is highly related to the RAG and Agent tool invocation scenarios that the industry currently emphasizes—the more the model dares to "improvise," the easier it loses points on the constraint dimension.

When nine top-tier models present an identical score distribution across the same set of 10 questions, what is truly exposed is not the models' capabilities, but the need for the evaluation itself to be iterated.

Today's evaluation showed no abnormal signals; all models had scores identical to yesterday's, with no new updates on the stability dimension for now.

In the short term, Material Constraint will become the core battleground for the next phase of model iteration; in the long term, Smoke needs to roll out harder Execution questions faster, otherwise the tie phenomenon will only become more frequent.

Data source: YZ Index | Run #148 | View Raw Data

9 Models Tie at 77.5 on Main Leaderboard, Code Execution Full Score but Material Constraint Only 50

The Real Signal Behind the Tie

Why Material Constraint Is Stuck at 50 Points Collectively

Industry Significance: Benchmark Has Entered a Saturation Phase

Related Reviews

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

Winzheng Index Claude Opus 4.7 drops 14 points on main leaderboard, Code Execution falls from 100 to 69

Winzheng Index DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100

Winzheng Index Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points