YZ Index Smoke Weekly: ERNIE Bot 4.5 Drops 37.2 Points, Multiple Models Fluctuate Over 28

Jun 28, 2026 27 Views - Read Source Winzheng Index

文心一言 4.5 Claude Sonnet 4.6 Smoke测试稳定性分析 Integrity Rating

In the YZ Index Smoke tests of 11 models from June 23 to 28, 2026, ERNIE Bot 4.5 dropped from 98.74 on Day 1 to 61.52 on Day 7, a trend of -37.2, with an average of only 82.1, and a fluctuation of 37.2, making it the model with the largest decline.

Most Models Saw Collective Decline on the Final Day

Claude Sonnet 4.6 scored 94.87 on Day 1, 70.52 on Day 7, trend -24.4, average 87.5, fluctuation 28.4. Claude Opus 4.7 scored 100 on Day 1, 71.47 on Day 7, trend -28.5, average 89.8, fluctuation 28.5. Gemini 2.5 Pro dropped from 96.18 to 81.41, trend -14.8, average 90.9, fluctuation 22.6. Gemini 3.1 Pro dropped from 100 to 91.21, trend -8.8, average 90.5, fluctuation 30.7. GPT-5.5 dropped from 96.18 to 84.18, trend -12, average 92.8, fluctuation 14.7. GPT-o3 dropped from 96.81 to 82.53, trend -14.3, average 91.6, fluctuation 17. Grok 4 dropped from 100 to 82.97, trend -17, average 93.3, fluctuation 18.5. DeepSeek V4 Pro dropped from 99.37 to 87.35, trend -12, average 94.2, fluctuation 17.8. Qwen3 Max dropped from 74 to 69.94, trend -4.1, average 81, fluctuation 28.1. Only Doubao Pro slightly increased from 98.07 to 98.61, trend 0.5, average 95.8, fluctuation 16.6.

High-Fluctuation Models Require Close Attention

ERNIE Bot 4.5 fluctuation 37.2, Gemini 3.1 Pro fluctuation 30.7, Claude Opus 4.7 fluctuation 28.5, Claude Sonnet 4.6 fluctuation 28.4, and Qwen3 Max fluctuation 28.1 all exceeded 28 points. These models showed large score standard deviations over 7 days, indicating significant differences in repeated answers to similar questions. With only 10 questions per day in Smoke, the sample is small, but data from 7 consecutive days already shows that final-day scores are generally below the average, suggesting a decline in model consistency during continuous testing.

Integrity Rating Changes Send Signals

ERNIE Bot 4.5 integrity rating changed from warn to pass, and Qwen3 Max changed from fail to warn and then to pass. Both models showed positive or stable changes in integrity ratings, but their score trends remained negative. Integrity ratings serve only as a threshold and do not directly add points. Even after rating improvements, scores still dropped significantly, indicating that model capability fluctuations are independent of the integrity dimension.

Possible Trends in Next Week's Full Evaluation

The 7-day Smoke trend shows that models with fluctuations exceeding 28 points are likely to continue experiencing score volatility in the Full evaluation. Doubao Pro, with an average of 95.8 and a trend of 0.5, may be the only model able to maintain a high level. The Claude, Gemini, and GPT series averages range from 87.5 to 93.3, but final-day scores have already dragged down overall performance. In the industry context, model iterations are typically accompanied by short-term stability declines, and the current data supports this inference.

Seven consecutive days of Smoke data indicate that AI model stability has shifted from an auxiliary indicator to a decisive constraint.

If the next week's Full evaluation uses the same question distribution, the final rankings of high-fluctuation models may see significant adjustments, while Doubao Pro has the opportunity to maintain its lead with its low-fluctuation advantage.

Data source: YZ Index (YZ Index) | Run #201 | View raw data

YZ Index Smoke Weekly: ERNIE Bot 4.5 Drops 37.2 Points, Multiple Models Fluctuate Over 28

Most Models Saw Collective Decline on the Final Day

High-Fluctuation Models Require Close Attention

Integrity Rating Changes Send Signals

Possible Trends in Next Week's Full Evaluation

Related Reviews

Winzheng Index Claude Sonnet 4.6 Takes Commanding Lead with 91.77 on Main Leaderboard, GPT-o3 Trails with Execution Score of 50

Winzheng Index Claude Sonnet 4.6 Smoke Review Main Score Plummets 25.9 Points, Code Execution Drops from 100 to 50

Winzheng Index Qwen3 Max Smoke Evaluation Main Score Plummets 12 Points, Integrity Rating Changes from Pass to Fail

Winzheng Index 文心一言4.5 Smoke Main Ranking Plunges 22.2 Points, Code Execution Halved to 50 Points