Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Jun 28, 2026 36 Views - Read Source Winzheng Index

Claude Sonnet 4.6 Code Execution Smoke Test 单日波动性能分析

In the June 2026 Smoke review of the YZ Index, Claude Sonnet 4.6's main score dropped from 96.45 to 70.52, code execution fell from 100.00 to 50.00, and material constraint rose from 92.10 to 95.60.

Sharp Fluctuation Driven by a Single Dimension

This 25.9-point decline in the main score was almost entirely determined by the code execution dimension. That dimension dropped directly from 100.00 yesterday to 50.00, a decrease of 50 points. The material constraint dimension, on the other hand, rose 3.5 points from 92.10 to 95.60, the engineering judgment dimension remained unchanged at 100.00, and task expression increased from 84.20 to 87.50. Among the two core main-score dimensions, only code execution experienced a cliff-like drop.

Characteristics of the Smoke Review and the Impact of Lottery

The Smoke review uses only 10 questions per day, with 2 questions per dimension, so the daily score standard deviation is naturally large. This time, the code execution dimension may have drawn questions sensitive to specific programming scenarios, causing the model to lose 50 points in a single day. The material constraint dimension rose slightly over the same period, indicating no systematic issue in the model's fundamental ability to follow constraints.

Real Degradation or Random Fluctuation?

Based on single-day data, this is more likely a random fluctuation caused by the question lottery. The engineering judgment dimension maintained 100.00 for two consecutive days, the task expression dimension also rose slightly, and the integrity rating remained pass, with no synchronized decline across dimensions. A real model degradation typically involves simultaneous deterioration in multiple dimensions, rather than an isolated 50-point drop in a single dimension.

Should We Continue to Monitor?

It is recommended to place Claude Sonnet 4.6 on the watchlist for tomorrow's Smoke review. If the code execution dimension remains below 70 points for two consecutive days, then combined with formal evaluation data, determine whether there is a version-level change. At present, a single-day 50-point drop alone is insufficient to conclude that the model's capabilities have undergone systematic degradation.

A 50-point halving of code execution is more likely a result of the lottery drawing than the model suddenly failing.

Data source: YZ Index | Run #201 | View Raw Data

Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Sharp Fluctuation Driven by a Single Dimension

Characteristics of the Smoke Review and the Impact of Lottery

Real Degradation or Random Fluctuation?

Should We Continue to Monitor?

Related Reviews

Winzheng Index Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

Winzheng Index Gemini 2.5 Pro Plunges 28 Points on Main Leaderboard, Code Execution Halved from 100

Winzheng Index 文心一言4.5 Smoke Main Ranking Plunges 22.2 Points, Code Execution Halved to 50 Points

Winzheng Index Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day