Smoke lightweight evaluation data from early this morning exposed Gemini 2.5 Pro's weaknesses, with its main score dropping to just 61.03. The execution dimension fell from 100 to 50, material compliance dropped 18 points, and its integrity rating changed from pass to warn. This is not a minor fluctuation but a systematic failure in execution capability.
Top Two Models Separated by Only 0.23 Points
DeepSeek V4 Pro takes first place with 95.28 points, achieving a perfect 100 on code execution and 89.5 points on material compliance (warn). GPT-o3 follows closely at 95.05 points, also scoring full marks on execution and 89 points on compliance (warn). The gap between them is less than 0.3 points, with the core difference being only 0.5 points in material compliance. This indicates that top models have generally reached a ceiling in code execution, and the real differentiator is the ability to strictly adhere to given materials.
Collective Decline in Material Compliance Becomes the Theme
The most notable anomaly today is the simultaneous sharp decline in material compliance scores across multiple models. Claude Sonnet 4.6's compliance score plummeted 22 points, GPT-5.5 dropped 15 points, and Grok 4 also fell 15.8 points. Doubao Pro declined 13.3 points from yesterday's high. Despite maintaining perfect execution scores of 100, these models lost points in material compliance, suggesting that newly added "strict material citation" questions in the test set significantly interfered with model performance.
In contrast, Wenxin Yiyan 4.5 rose against the trend by 27.3 points, with its execution score recovering from 50 to 100, indicating targeted optimizations in code tasks. However, its integrity rating changed from pass to warn, suggesting new consistency issues.
Possible Drivers Behind the Anomalies
Gemini 2.5 Pro's cliff-like drop is highly unusual. The halving of its execution score, combined with the simultaneous decline in material compliance, strongly suggests that a model version update early this morning introduced a new alignment strategy, causing the model to be overly conservative or directly refuse to answer in scenarios requiring strict adherence to given materials. Similar situations have occurred in the industry during previous Claude series updates, typically taking 2-3 days to recover.
The concurrent decline in material compliance across multiple models may also be related to the addition of more "long-context + precise citation" mixed questions to Smoke's question bank today. Such questions impose higher requirements on models' grounding capabilities, exposing the real shortcomings of previously high-scoring models.
Perfect execution has become standard; material compliance is the true battlefield of the next stage.
Today's rankings show that DeepSeek V4 Pro and GPT-o3 have achieved material compliance scores in the 89-point range, while other models remain stuck between 74 and 79 points. The gap continues to widen.
It is expected that within the next 48 hours, if Gemini 2.5 Pro cannot quickly recover, its credibility in the developer community will be further eroded. Meanwhile, DeepSeek V4 Pro, with its consistently perfect execution scores, has already established a clear advantage in engineering deployment scenarios.
Data source: YZ Index | Run #132 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接