On July 5, 2026, the YZ Index Smoke Quick Test covered 11 models, with Doubao Pro and Gemini 3.1 Pro tying for first place at 88.54 points. Smoke is a daily 10-question quick test designed for observing short-term signals and is not equivalent to the conclusions of the Full weekly ranking.
This Smoke evaluation only covers two main ranking dimensions: code execution and material constraints. The main ranking formula is 0.55 × Code Execution + 0.45 × Material Constraints. Due to the small daily sample size, single-day scores are better used as monitoring signals rather than long-term conclusions about model capabilities.
Daily Ranking
| Rank | Model | Main Score | Code Execution | Material Constraints | Integrity |
|---|---|---|---|---|---|
| #1 | Doubao Pro | 88.54 | 97 | 78.2 | pass |
| #2 | Gemini 3.1 Pro | 88.54 | 97 | 78.2 | pass |
| #3 | Gemini 2.5 Pro | 83.32 | 87.5 | 78.2 | pass |
| #4 | Grok 4 | 81.44 | 75 | 89.3 | warn |
| #5 | Claude Sonnet 4.6 | 79.79 | 72 | 89.3 | pass |
| #6 | GPT-o3 | 79.79 | 72 | 89.3 | pass |
| #7 | DeepSeek V4 Pro | 77.72 | 88.7 | 64.3 | pass |
| #8 | GPT-5.5 | 74.79 | 72 | 78.2 | pass |
| #9 | Claude Opus 4.7 | 70.6 | 55.3 | 89.3 | pass |
| #10 | Qwen3 Max | 63.73 | 42.8 | 89.3 | pass |
| #11 | GLM-4.6 | 60.04 | 88.7 | 25 | fail |
Data Interpretation
In today's YZ Index Smoke Quick Test, Doubao Pro and Gemini 3.1 Pro tied for the top main score at 88.54, both with code execution at 97 and material constraints at 78.2, forming a structure of high code execution paired with moderate material constraints. Gemini 2.5 Pro scored 83.32 on the main ranking, with code execution at 87.5 and material constraints at 78.2, also leaning toward code execution. Grok 4 scored 81.44 on the main ranking, with code execution at 75 and material constraints at 89.3, showing a stronger material constraint combination.
Claude Opus 4.7's main score dropped 24.7 points from the previous test, with code execution down 41.7 points; Gemini 3.1 Pro's main score rose 18.1 points, with code execution up 25 points and material constraints up 9.7 points; Grok 4's main score fell 15.1 points, with code execution down 24.2 points and integrity changing from pass to warn; Gemini 2.5 Pro's main score dropped 13.7 points, with code execution down 12.5 points and material constraints down 15.1 points; GPT-o3's main score fell 12.6 points, with code execution down 25 points. In small-sample single-day data, these fluctuations may stem from question sampling variance or reflect real performance changes, requiring subsequent runs under the same conditions for verification.
DeepSeek V4 Pro experienced a sharp drop of -15.8 points in material constraints, contrasting significantly with its code execution score of 88.7. This signal also requires multiple retests to determine whether it is a random fluctuation.
Key Changes
- Claude Opus 4.7: Main score down 24.7 points, code execution -41.7 points
- Gemini 3.1 Pro: Main score up 18.1 points, code execution +25 points, material constraints +9.7 points
- Grok 4: Main score down 15.1 points, code execution -24.2 points, integrity pass → warn
- Gemini 2.5 Pro: Main score down 13.7 points, code execution -12.5 points, material constraints -15.1 points
- GPT-o3: Main score down 12.6 points, code execution -25 points
Signals to Monitor
- DeepSeek V4 Pro: Material constraints plunged by -15.8 points
- GLM-4.6: Today's integrity rating is fail (based on today's Smoke data).
When reading such Smoke briefs, the focus should be on two questions: first, whether a model exposes the same type of weakness for multiple consecutive days; second, whether the integrity rating changes from pass to warn or fail. Large single-day fluctuations in execution or constraints scores may stem from question sampling, or could be early signals of genuine degradation, requiring subsequent runs for verification.
Data source: YZ Index | Run #214 | View raw data
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接