The YZ Index Smoke quick test of 11 models from June 17 to 21, 2026 shows that Grok 4 rose from 80.2 points on the first day to 100 points on the last day, a trend increase of 19.8 points, making it the model with the largest increase this week.
Steadily Rising Models Concentrated in Mid-to-Low Base Players
DeepSeek V4 Pro has a weekly average of 98.7 points, starting at 97.3 on the first day and ending at 100 on the last day, with a trend increase of 2.7 points and a volatility of only 2.7 points, making it the most balanced performer. GPT-o3 has an average of 97.9 points, a trend increase of 2.3 points, and also reached 100 points on the last day. 豆包 Pro has an average of 96.7 points, a trend increase of 1.2 points, and a final score of 96.63 points. Qwen3 Max rose from 73.25 points to 80.82 points, a trend increase of 7.6 points, with an average of 87.7 points. 文心一言 4.5 rose from 71.33 points to 88.28 points, a trend increase of 17 points, with an average of 84.3 points. These models maintained a positive trend in the 7-day 10-question quick test, with no significant decline.
Flat Models Mostly High-Ranking Claude
Claude Opus 4.7 has an average of 99.4 points, starting at 100 on the first day and ending at 99.28 on the last day, with a slight trend decrease of 0.7 points and volatility of 2.3 points, maintaining the most stable high level. Claude Sonnet 4.6 has an average of 96.7 points, with a slight trend decrease of 0.8 points. Gemini 2.5 Pro has an average of 92.3 points, with a slight trend increase of 0.5 points. GPT-5.5 has an average of 92 points, with a slight trend decrease of 0.8 points. The score ranges of the above models have narrowed, and no sustained breakthrough has been formed yet.
High Volatility Models Concentrate Risk
Gemini 2.5 Pro has a volatility of 28.3 points, Gemini 3.1 Pro 29 points, GPT-5.5 26.3 points, Qwen3 Max 26.8 points, and 文心一言 4.5 26.4 points. The YZ Index stability dimension formula is max(0, 100-stddev×2). High standard deviation directly leads to low stability scores, indicating poor consistency in scores for similar tasks. Grok 4 has a volatility of 19.8 points. Although the trend is strong, the daily score jumps are also significant.
Integrity Rating and Availability Signals
This week's Smoke data recorded no changes in integrity ratings, and all models remained operational. Stability and availability serve only as operational signals and are not included in the main ranking's code execution and material constraint dimensions.
Next Week's Full Evaluation Prediction
DeepSeek V4 Pro and Claude Opus 4.7, due to high averages and low volatility, are likely to maintain top positions in next week's full evaluation. If Grok 4 continues its 19.8-point trend, it could enter the top three, but it remains to be seen whether its 19.8-point volatility narrows. The high volatility of the Gemini series and GPT-5.5 may continue to drag down stability scores, affecting performance in the engineering judgment sub-ranking. Qwen3 Max and 文心一言 4.5 still have an upward channel, but with low bases, their sustainability needs to be verified with a larger sample.
High-volatility models have already exposed consistency weaknesses in the Smoke phase, and next week's Full evaluation is likely to amplify this gap.
Data source: YZ Index (YZ Index) | Run #190 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接