Doubao Pro and Gemini 3.1 Pro tied at 88.54: 2026-07-05 Smoke Quick Test Data Brief

On July 5, 2026, the YZ Index Smoke Quick Test covered 11 models, with Doubao Pro and Gemini 3.1 Pro tying for first place at 88.54 points. Smoke is a daily 10-question quick test designed for observing short-term signals and is not equivalent to the conclusions of the Full weekly ranking.

This Smoke evaluation only covers two main ranking dimensions: code execution and material constraints. The main ranking formula is 0.55 × Code Execution + 0.45 × Material Constraints. Due to the small daily sample size, single-day scores are better used as monitoring signals rather than long-term conclusions about model capabilities.

Daily Ranking

RankModelMain ScoreCode ExecutionMaterial ConstraintsIntegrity
#1Doubao Pro88.549778.2pass
#2Gemini 3.1 Pro88.549778.2pass
#3Gemini 2.5 Pro83.3287.578.2pass
#4Grok 481.447589.3warn
#5Claude Sonnet 4.679.797289.3pass
#6GPT-o379.797289.3pass
#7DeepSeek V4 Pro77.7288.764.3pass
#8GPT-5.574.797278.2pass
#9Claude Opus 4.770.655.389.3pass
#10Qwen3 Max63.7342.889.3pass
#11GLM-4.660.0488.725fail

Data Interpretation

In today's YZ Index Smoke Quick Test, Doubao Pro and Gemini 3.1 Pro tied for the top main score at 88.54, both with code execution at 97 and material constraints at 78.2, forming a structure of high code execution paired with moderate material constraints. Gemini 2.5 Pro scored 83.32 on the main ranking, with code execution at 87.5 and material constraints at 78.2, also leaning toward code execution. Grok 4 scored 81.44 on the main ranking, with code execution at 75 and material constraints at 89.3, showing a stronger material constraint combination.

Claude Opus 4.7's main score dropped 24.7 points from the previous test, with code execution down 41.7 points; Gemini 3.1 Pro's main score rose 18.1 points, with code execution up 25 points and material constraints up 9.7 points; Grok 4's main score fell 15.1 points, with code execution down 24.2 points and integrity changing from pass to warn; Gemini 2.5 Pro's main score dropped 13.7 points, with code execution down 12.5 points and material constraints down 15.1 points; GPT-o3's main score fell 12.6 points, with code execution down 25 points. In small-sample single-day data, these fluctuations may stem from question sampling variance or reflect real performance changes, requiring subsequent runs under the same conditions for verification.

DeepSeek V4 Pro experienced a sharp drop of -15.8 points in material constraints, contrasting significantly with its code execution score of 88.7. This signal also requires multiple retests to determine whether it is a random fluctuation.

Key Changes

  • Claude Opus 4.7: Main score down 24.7 points, code execution -41.7 points
  • Gemini 3.1 Pro: Main score up 18.1 points, code execution +25 points, material constraints +9.7 points
  • Grok 4: Main score down 15.1 points, code execution -24.2 points, integrity pass → warn
  • Gemini 2.5 Pro: Main score down 13.7 points, code execution -12.5 points, material constraints -15.1 points
  • GPT-o3: Main score down 12.6 points, code execution -25 points

Signals to Monitor

  • DeepSeek V4 Pro: Material constraints plunged by -15.8 points
  • GLM-4.6: Today's integrity rating is fail (based on today's Smoke data).

When reading such Smoke briefs, the focus should be on two questions: first, whether a model exposes the same type of weakness for multiple consecutive days; second, whether the integrity rating changes from pass to warn or fail. Large single-day fluctuations in execution or constraints scores may stem from question sampling, or could be early signals of genuine degradation, requiring subsequent runs for verification.


Data source: YZ Index | Run #214 | View raw data

This article is from Winzheng Index blog, translated in full by Winzheng (winzheng.com). Click here to view the original When republishing the translation, please credit the source. Thank you!