Skip to main content
Winzheng
YZ Index News Topics Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All Artificial Intelligence(360) OpenAI(357) Anthropic(276) AI Safety(180) AI Agents(147) AI Ethics(110) Generative AI(96) xAI(91) Google(87) Meta(87) Data Centers(77) WDCD(76) AI(73) AI Regulation(72) Elon Musk(72) Funding(69) Claude(66) AI Chips(63) ChatGPT(62) Cybersecurity(60) Smoke Test(58)

Doubao Pro Material Constraint Plunges 15.9 Points: Causes of Smoke Single-Day Test Anomaly

During actual testing of 11 models in the YZ Index in June 2026, Doubao Pro's material constraint score in the Smoke evaluation dropped from 100.00 to 84.10, a decline of 15.9 points, causing its main ranking total score to fall from 100.00 to 92.85.

Doubao Pro Material Constraints Smoke Test
69 06-19

GPT-o3 Material Constraint Plunges 15.2 Points in a Single Day, Smoke Main Board Drops from 100 to 93.16

In the June 2026 YZ Index real-world test of 11 models, GPT-o3's Smoke evaluation material constraint score dropped from 100.00 to 84.80 in a single day, pulling the main board from 100.00 to 93.16. The decline is likely due to small-sample volatility, but a potential model degradation cannot be ruled out yet.

GPT-o3 Material Constraints Smoke Test
77 06-19

Smoke Evaluation: Qwen3 Max Constraints Surge +23 Points, GPT-o3 Material Constraints Plunge 15.2 Points

In the YZ Index Smoke lightweight evaluation on June 19, 2026, Gemini 3.1 Pro topped the main leaderboard with 99.28 points, 100 in code execution, and 98.4 in material constraints. The weighted structure of 0.55× execution + 0.45× constraints highlights its dual-dimension balance advantage.

Qwen3 Max Material Constraints Gemini 3.1 Pro
69 06-19

Grok 4 Material Constraint Plunges 25.6 Points, Yet Leaderboard Rises to 87 Points

In today's YZ Index Smoke evaluation, Grok 4's Material Constraint score dropped from 96.70 to 71.10, a decrease of 25.6 points, but Code Execution rose from 66.70 to 100 points, lifting the overall leaderboard from 80.20 to 87 points.

Grok 4 Material Constraints Smoke Test
88 06-18

Grok 4 Material Constraint Plummets 25.6 Points; Four Models Tie for Perfect Score on Main Leaderboard

Four models achieve perfect scores on both code execution and material constraint dimensions, while Grok 4 suffers a sharp 25.6-point drop in constraint, offsetting its execution gains and falling to the bottom.

Grok 4 Material Constraints Smoke Test
76 06-18

WDCD Three-Round Attenuation Test: GPT-o3 R3 Collapse Rate 50%, Qwen3 Max Zero Collapse

In the WDCD three-round test, GPT-o3's collapse rate in the R3 phase reached 50%, while Qwen3 Max had zero collapses in R3. Both models scored 1.00 in R1 confirmation rate but showed vastly different integrity trajectories under sustained pressure.

WDCD Compliance Test 模型衰减
121 06-17

Qwen3 Max Scores 92.50 to Top WDCD Commitment Ranking; Doubao Pro 62.50 Ranks Last with 30-Point Gap

Qwen3 Max scored 92.50 to top the WDCD Commitment Ranking, leading second-place Claude Sonnet 4.6 by 2.5 points, while Doubao Pro scored 62.50 to rank last among 11 models, trailing the champion by 30 points.

WDCD Compliance Test Qwen3 Max
105 06-17

文心一言4.5 Main Leaderboard Plunges 10.4 Points, Task Expression Dimension Halved from 90 to 46.3

文心一言4.5's main leaderboard score dropped 10.4 points in a single day in the Smoke Evaluation, and the Task Expression dimension plummeted from 90 to 46.3. The decline is likely due to random sampling fluctuations rather than systematic capability degradation.

ERNIE Bot 4.5 主榜 Smoke Test
97 06-17

Qwen3 Max Material Constraint Plunged 28.9 Points, but Main Leaderboard Rose Slightly by 0.8

In the YZ Index Smoke evaluation, Qwen3 Max's Material Constraint score dropped from 100.00 to 71.10, a decline of 28.9 points. However, its main leaderboard score rose slightly by 0.8 points, indicating the fluctuation is likely due to test question sampling randomness rather than systematic capability degradation.

Qwen3 Max Material Constraints Smoke Test
89 06-17

Qwen3 Max Material Constraint Plummets 28.9 Points, Today's Smoke 11 Model Main Leaderboard Reshuffles

During the June 17, 2026 test of 11 models by YZ Index, Qwen3 Max's material constraint score dropped sharply from 100 points yesterday to 71.1 points, and its main leaderboard score was only 73.25 points, making it the most prominent anomaly of the day.

Qwen3 Max Material Constraints Smoke Light Test
87 06-17

豆包Pro Smoke Evaluation Main Ranking Plunges 9.9 Points, Code Execution Halved from 100 to 50

In the YZ Index June 2026 test of 11 models, the main ranking score of 豆包Pro dropped 9.9 points to 72.50, as code execution halved from 100.00 to 50.00. This single-day fluctuation may stem from random question draw, requiring further observation to determine degradation.

Doubao Pro Code Execution Smoke Test
155 06-16

Claude Sonnet 4.6 Code Execution Plunges from 100 to 50, Main Score Drops 6.9 Points

In the YZ Index June 2026 Smoke evaluation of 11 models, Claude Sonnet 4.6's code execution score dropped sharply from 100.00 to 50.00, causing its main score to fall from 79.44 to 72.50.

Claude Sonnet 4.6 Code Execution Smoke Test
157 06-16

Claude Opus 4.7 Scores 100 to Claim Crown, 9 Models See Code Execution Plummet by 50 Points

Claude Opus 4.7 scored 100 on the main leaderboard with perfect scores in code execution and material constraint, while nine models experienced a 50-point plunge in code execution.

Claude Opus 4.7 Code Execution Smoke Test
137 06-16

Doubao Pro Material Constraint Plunges 24 Points, Code Execution Soars from 38.4 to 100

In today's Smoke evaluation, Doubao Pro's Material Constraint score dropped from 84.80 to 60.80, while Code Execution surged from 38.40 to 100.00, with the main ranking score rising from 59.28 to 82.36, indicating that the extreme fluctuations are more likely due to question sampling probability rather than model capability degradation.

Doubao Pro Material Constraints Smoke测试
261 06-15

Grok 4 Material Constraint Plummets 21.7 Points, Code Execution Rises to 100

In today's Smoke evaluation on the YZ Index, Grok 4's material constraint score dropped from 83.00 to 61.30, a decline of 21.7 points, while code execution score rose from 80.90 to 100.00.

Grok 4 Material Constraints Smoke Test
239 06-15

Material Constraint Plunged by 39 Points, All 11 Models on YZ Index Main Leaderboard Decline

On June 15, 2026, the YZ Index main leaderboard for 11 models dropped collectively due to a sharp decline in Material Constraint scores, with a maximum drop of 39 points. Grok 4 remained first but saw its constraint fall to 61.3, close to the pass line.

Material Constraints Grok 4 Smoke Light Test
169 06-15

Qwen3 Max tops WDCD Compliance Leaderboard with 84.38 points, GPT-o3 at bottom with 67.19 points, a gap of 17 points

Qwen3 Max leads the WDCD Compliance Leaderboard with 84.38 points. GPT-o3 ranks last with 67.19 points, trailing by 17.19 points.

WDCD Compliance Test Qwen3 Max
302 06-14

Gemini 2.5 Pro Code Execution Plunges 45 Points, Smoke Main Score Drops 19.3 in One Day

Gemini 2.5 Pro's Smoke evaluation main score fell from 89.79 yesterday to 70.53 today, a drop of 19.3 points. The code execution dimension tumbled from 100.00 to 55.00, while the material constraint dimension rose from 77.30 to 89.50.

Gemini 2.5 Pro Code Execution Smoke Test
223 06-14

Grok 4 Code Execution Plunges 19.1 Points, Main Ranking Drops 7.7 – Sampling or Degradation?

In the June 2026 YZ Index test of 11 models, Grok 4's Smoke evaluation code execution score dropped from 100.00 yesterday to 80.90, and its main ranking overall fell from 89.56 to 81.85.

Grok 4 Code Execution Smoke Test
205 06-14

Claude Opus 4.7 Drops 26.9 Points, GPT-5.5 Rises 3.1 Points Against the Trend: Three-Day Smoke Trend

In the three-day Smoke quick test from June 12 to June 14, 2026, Claude Opus 4.7 dropped 26.9 points from 96.83 to 69.91, making it the model with the largest decline. In contrast, GPT-5.5 was the only model showing an upward trend, with a trend value of +3.1.

Claude Opus 4.7 GPT-5.5 Smoke快测
219 06-14
1 2 3

© 1998-2026 Winzheng All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service
AI Research: WDCD Dataset Konton Prompt it. Play it. MaxTerm MaxModel CyberFate no LLM judging an LLM

This benchmark operates independently and accepts no sponsorship from AI model vendors. Every score in the YZ Index is produced by automated evaluation.

Citation format: YZ Index (2026). AI Model Comprehensive Rankings. https://www.winzheng.com/yz-index/

Data License: CC BY-NC 4.0