Doubao Pro and Gemini 3.1 Pro tied at 88.54: 2026-07-05 Smoke Quick Test Data Brief

Jul 5, 2026 35 Views - Read Source Winzheng Index

YZ Index Smoke快测 AI Evaluation 模型排名数据简报

On July 5, 2026, the YZ Index Smoke Quick Test covered 11 models, with Doubao Pro and Gemini 3.1 Pro tying for first place at 88.54 points. Smoke is a daily 10-question quick test designed for observing short-term signals and is not equivalent to the conclusions of the Full weekly ranking.

This Smoke evaluation only covers two main ranking dimensions: code execution and material constraints. The main ranking formula is 0.55 × Code Execution + 0.45 × Material Constraints. Due to the small daily sample size, single-day scores are better used as monitoring signals rather than long-term conclusions about model capabilities.

Daily Ranking

Rank	Model	Main Score	Code Execution	Material Constraints	Integrity
#1	Doubao Pro	88.54	97	78.2	pass
#2	Gemini 3.1 Pro	88.54	97	78.2	pass
#3	Gemini 2.5 Pro	83.32	87.5	78.2	pass
#4	Grok 4	81.44	75	89.3	warn
#5	Claude Sonnet 4.6	79.79	72	89.3	pass
#6	GPT-o3	79.79	72	89.3	pass
#7	DeepSeek V4 Pro	77.72	88.7	64.3	pass
#8	GPT-5.5	74.79	72	78.2	pass
#9	Claude Opus 4.7	70.6	55.3	89.3	pass
#10	Qwen3 Max	63.73	42.8	89.3	pass
#11	GLM-4.6	60.04	88.7	25	fail

Data Interpretation

In today's YZ Index Smoke Quick Test, Doubao Pro and Gemini 3.1 Pro tied for the top main score at 88.54, both with code execution at 97 and material constraints at 78.2, forming a structure of high code execution paired with moderate material constraints. Gemini 2.5 Pro scored 83.32 on the main ranking, with code execution at 87.5 and material constraints at 78.2, also leaning toward code execution. Grok 4 scored 81.44 on the main ranking, with code execution at 75 and material constraints at 89.3, showing a stronger material constraint combination.

Claude Opus 4.7's main score dropped 24.7 points from the previous test, with code execution down 41.7 points; Gemini 3.1 Pro's main score rose 18.1 points, with code execution up 25 points and material constraints up 9.7 points; Grok 4's main score fell 15.1 points, with code execution down 24.2 points and integrity changing from pass to warn; Gemini 2.5 Pro's main score dropped 13.7 points, with code execution down 12.5 points and material constraints down 15.1 points; GPT-o3's main score fell 12.6 points, with code execution down 25 points. In small-sample single-day data, these fluctuations may stem from question sampling variance or reflect real performance changes, requiring subsequent runs under the same conditions for verification.

DeepSeek V4 Pro experienced a sharp drop of -15.8 points in material constraints, contrasting significantly with its code execution score of 88.7. This signal also requires multiple retests to determine whether it is a random fluctuation.

Key Changes

Claude Opus 4.7: Main score down 24.7 points, code execution -41.7 points
Gemini 3.1 Pro: Main score up 18.1 points, code execution +25 points, material constraints +9.7 points
Grok 4: Main score down 15.1 points, code execution -24.2 points, integrity pass → warn
Gemini 2.5 Pro: Main score down 13.7 points, code execution -12.5 points, material constraints -15.1 points
GPT-o3: Main score down 12.6 points, code execution -25 points

Signals to Monitor

DeepSeek V4 Pro: Material constraints plunged by -15.8 points
GLM-4.6: Today's integrity rating is fail (based on today's Smoke data).

When reading such Smoke briefs, the focus should be on two questions: first, whether a model exposes the same type of weakness for multiple consecutive days; second, whether the integrity rating changes from pass to warn or fail. Large single-day fluctuations in execution or constraints scores may stem from question sampling, or could be early signals of genuine degradation, requiring subsequent runs for verification.

Data source: YZ Index | Run #214 | View raw data

This article is from Winzheng Index blog, translated in full by Winzheng (winzheng.com). Click here to view the original When republishing the translation, please credit the source. Thank you!

Doubao Pro and Gemini 3.1 Pro tied at 88.54: 2026-07-05 Smoke Quick Test Data Brief

Daily Ranking

Data Interpretation

Key Changes

Signals to Monitor

Related Reviews

Winzheng Index Doubao Pro Smoke Evaluation Main Ranking Plunges 18.6 Points, Code Execution Drops 38.8 in a Single Day

Winzheng Index Grok 4 Smoke Evaluation Main Score Plummets 15.3 Points, Code Execution Drops 31.4 in a Single Day

Winzheng Index Claude Opus 4.7 Tops with 94.82 Points, Gemini 3.1 Pro Plunges 32.2 Points

Winzheng Index Claude Opus 4.7 Smoke Evaluation Main Benchmark Drops 27.5 Points, Code Execution from 100 to 50