Gemini 2.5 Pro Plunges 35.6 Points on Main Leaderboard, DeepSeek V4 Pro Tops Smoke Benchmark

May 26, 2026 510 Views - Read Source Winzheng Index

DeepSeek V4 Pro Material Constraints Gemini 2.5 Pro Smoke 轻量评测 Model Fluctuations

Smoke lightweight evaluation data from early this morning exposed Gemini 2.5 Pro's weaknesses, with its main score dropping to just 61.03. The execution dimension fell from 100 to 50, material compliance dropped 18 points, and its integrity rating changed from pass to warn. This is not a minor fluctuation but a systematic failure in execution capability.

Top Two Models Separated by Only 0.23 Points

DeepSeek V4 Pro takes first place with 95.28 points, achieving a perfect 100 on code execution and 89.5 points on material compliance (warn). GPT-o3 follows closely at 95.05 points, also scoring full marks on execution and 89 points on compliance (warn). The gap between them is less than 0.3 points, with the core difference being only 0.5 points in material compliance. This indicates that top models have generally reached a ceiling in code execution, and the real differentiator is the ability to strictly adhere to given materials.

Collective Decline in Material Compliance Becomes the Theme

The most notable anomaly today is the simultaneous sharp decline in material compliance scores across multiple models. Claude Sonnet 4.6's compliance score plummeted 22 points, GPT-5.5 dropped 15 points, and Grok 4 also fell 15.8 points. Doubao Pro declined 13.3 points from yesterday's high. Despite maintaining perfect execution scores of 100, these models lost points in material compliance, suggesting that newly added "strict material citation" questions in the test set significantly interfered with model performance.

In contrast, Wenxin Yiyan 4.5 rose against the trend by 27.3 points, with its execution score recovering from 50 to 100, indicating targeted optimizations in code tasks. However, its integrity rating changed from pass to warn, suggesting new consistency issues.

Possible Drivers Behind the Anomalies

Gemini 2.5 Pro's cliff-like drop is highly unusual. The halving of its execution score, combined with the simultaneous decline in material compliance, strongly suggests that a model version update early this morning introduced a new alignment strategy, causing the model to be overly conservative or directly refuse to answer in scenarios requiring strict adherence to given materials. Similar situations have occurred in the industry during previous Claude series updates, typically taking 2-3 days to recover.

The concurrent decline in material compliance across multiple models may also be related to the addition of more "long-context + precise citation" mixed questions to Smoke's question bank today. Such questions impose higher requirements on models' grounding capabilities, exposing the real shortcomings of previously high-scoring models.

Perfect execution has become standard; material compliance is the true battlefield of the next stage.

Today's rankings show that DeepSeek V4 Pro and GPT-o3 have achieved material compliance scores in the 89-point range, while other models remain stuck between 74 and 79 points. The gap continues to widen.

It is expected that within the next 48 hours, if Gemini 2.5 Pro cannot quickly recover, its credibility in the developer community will be further eroded. Meanwhile, DeepSeek V4 Pro, with its consistently perfect execution scores, has already established a clear advantage in engineering deployment scenarios.

Data source: YZ Index | Run #132 | View raw data

Gemini 2.5 Pro Plunges 35.6 Points on Main Leaderboard, DeepSeek V4 Pro Tops Smoke Benchmark

Top Two Models Separated by Only 0.23 Points

Collective Decline in Material Compliance Becomes the Theme

Possible Drivers Behind the Anomalies

Related Reviews

Winzheng Index Gemini 2.5 Pro Material Constraint Plunges 15.2 Points, Code Execution Soars 45 Points

Winzheng Index Qwen3 Max Main Board Plunges 12.9 Points, Gemini 2.5 Pro Leads Smoke Lite List with 96.99 Points

Winzheng Index ERNIE Bot Main Score Plunges 40.3 Points, Smoke Evaluation Reveals Dual Collapse in Execution and Constraint

Winzheng Index Smoke Evaluation: Qwen3 Max Constraints Surge +23 Points, GPT-o3 Material Constraints Plunge 15.2 Points