Skip to main content
Winzheng
YZ Index News Topics Winzheng Lab WDCD
Subscribe
中文 English 日本語
All Original Global Reviews
All Artificial Intelligence(362) OpenAI(357) Anthropic(287) AI Safety(184) AI Agents(147) AI Ethics(110) Generative AI(96) xAI(92) Google(87) Meta(87) Data Centers(78) WDCD(76) AI Regulation(74) AI(73) Elon Musk(72) Funding(69) Claude(68) ChatGPT(63) AI Chips(63) Smoke Test(62) Cybersecurity(61)

This Week's 11-Model Overhaul: Newcomer Qwen3 Max Enters with 68.5, Veterans at 75 Exit En Masse

This week’s YZ Index v6 main leaderboard saw six legacy models removed and five new ones added simultaneously, reshuffling the top ten within a single week.

Qwen3 Max Code Execution 模型迭代
243 05-18

Gemini 3.1 Pro Main Score Plunges 11.1 Points, Code Execution Halved from 100

In today's Smoke quick test, Gemini 3.1 Pro's main score dropped 11.1 points, primarily due to code execution falling from 100 to 75, while material constraint rose slightly to 75.

Gemini 3.1 Pro Code Execution Smoke Test
245 05-18

Qwen3 Max Main Index Plummets 10.9 Points, Code Execution Halved by 25 Points in a Single Day

Qwen3 Max's main index dropped 10.9 points in today's Smoke test, with the code execution dimension falling from a perfect 100 to 75. This one-day fluctuation exceeds the normal random variance and requires serious attention.

Qwen3 Max Code Execution Model Evaluation
200 05-18

GPT-5.5 Main Ranking Plunges 23.5 Points, Doubao Pro 97.75 Tops Smoke

Today's Smoke lightweight evaluation results show Doubao Pro leading with 97.75 points (Execution 100, Constraint 95), becoming the only model among 11 mainstream models to break 97 points on the main ranking. GPT-5.5, which was previously expected to perform well, scored only 60.58, dropping 23.5 points compared to yesterday.

Doubao Pro GPT-5.5 Smoke Test
243 05-18

WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes

In this WDCD cycle, GPT-5.5 re-establishes the ceiling of instruction adherence with an absolute score of 71.67, while Gemini 2.5 Pro's 14.2-point leap completely overturns the perception that Google models are weak in adherence. Meanwhile, Wenxin Yiyan 4.5 suffers a 7.5-point drop, signaling potential over-alignment issues.

WDCD Compliance Test Model Updates
356 05-17

Resource Constraints Become the Hardest Scenario in WDCD, Doubao Scores 3.5 Points in Business Rules, Surpassing GPT

The WDCD five-scenario evaluation reveals that resource constraints is the hardest scenario with the lowest overall scores, while DoubaoPro achieves the highest score in business rules, demonstrating significant model specialization.

WDCD Compliance Test 模型横评
331 05-17

R3 Collapse Rate 93.3%! Grok4 WDCD Three-Round Test: First Round Fully Compliant, Last Round Crashes

The WDCD three-round test reveals that model integrity drops to 30.6% under direct pressure in R3, with Grok4 hitting a 93.3% collapse rate, exposing the fragility of safety alignment.

WDCD Compliance Test 模型衰减
324 05-17

WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

The WDCD Commitment Test reveals models' true performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while Grok 4 scores only 52.5 points, ranking last—a gap of 19.17 points between the top and bottom.

WDCD Compliance Test AI模型排行
277 05-17

Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Claude Sonnet 4.6 showed abnormal results in today's Smoke test, with the material constraint dimension dropping sharply. The drop may be due to sampling variance but warrants further monitoring.

Claude Sonnet 4.6 Material Constraints Smoke Test
346 05-17

Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

In today's Smoke evaluation, Claude Opus 4.7's main score dropped by 9 points from 97.75 to 88.75, primarily due to a sharp decline in the material constraint dimension from 95 to 75 points—a direct loss of 20 points in a single day.

Claude Opus 4.7 Material Constraints Smoke快测
337 05-17

7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline

This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.

ERNIE Bot GPT-o3 Smoke Test
336 05-17

Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

Today's Smoke Lite evaluation results show a three-way tie for first place at 88.75 points, while the Claude series suffered sharp declines. The shakeup signals that open models are rapidly closing the gap with closed-source leaders.

Claude Opus 4.7 Material Constraints Smoke Light Test
327 05-17

GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?

GPT-5.5 Code Execution Smoke Test
377 05-16

Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained perfect.

Gemini 2.5 Pro Material Constraints Smoke Test
346 05-16

Three Models Plunge by 28 Points, Claude Still Near Perfect Score

Today's YZ Index Smoke lightweight test reveals that three leading models suffered significant drops, while Claude models dominate near-perfect scores with structural advantages in code execution and material constraint.

Claude Sonnet 4.6 GPT-5.5 Code Execution
432 05-16

DeepSeek gains 5 points but fails: 10-question Smoke test alarm

Today's Smoke evaluation shows the main benchmark up by 5 points, but the integrity rating drops from pass to fail, signaling a classic alarm of "seemingly stronger capability but lost trustworthiness at the admission gate."

DeepSeek V4 Pro Integrity Rating Smoke Test
399 05-15

Claude Sonnet 4.6 Material Grounding Plunges 27.5 Points, But Main Leaderboard Rises Against the Trend by 1.4 Points?

In today's Smoke evaluation, Anthropic's Claude Sonnet 4.6 saw a dramatic split: material grounding scores dropped 27.5 points to 69, while code execution surged 25 points to a perfect 100, with the main leaderboard edging up 1.4 points to 86.05.

Claude Sonnet 4.6 Material Constraints Smoke Test
397 05-15

Two Zero-Execution Shocks, Claude Holds at 88.75

Today’s Smoke benchmark shows Claude Opus 4.7 leading with 88.75, while two models scored zero in code execution; the real differentiator is material constraint, not execution ability.

Claude Opus 4.7 Material Constraints Smoke Test
367 05-15

GPT-OSS 20B: A Sparse MoE Pretraining Benchmark for MLPerf Training v6.0

MLCommons 为 MLPerf Training v6.0 引入 GPT-OSS 20B 预训练基准,用更小硬件门槛评测 MoE 稀疏训练能力。该基准通过固定验证集、优化器稳定化和统一初始化,将训练波动显著压低,目标是让成绩更真实反映系统效率。

MLC MLPerf Training GPT-OSS 20B
439 05-14

Claude Opus 4.7 Smoke Evaluation Main Chart Plunges 9.6 Points: Degradation Signal or Lottery Farce?

In today's Smoke Evaluation, Claude Opus 4.7's main chart score plummeted from 89.43 to 79.86, a net loss of 9.6 points, with code execution collapsing from a perfect 100 to 75. The sharp drop raises the question of whether this signals model degradation or is merely a random sampling fluctuation.

Claude Opus 4.7 YZ Index Smoke Test
422 05-14
5 6 7 8 9

© 1998-2026 Winzheng All rights reserved.

Founded in 1998, relaunched in 2025. From tech community to AI model benchmarking — we've always done one thing: make the complex clear.

YZ Index News Winzheng Lab About Us Subscribe Privacy Policy Terms of Service
AI Research: WDCD · Multi-turn Constraint Dataset MaxModel Developer Docs MaxTerm · AI Ops Terminal MaxModel · LLM API Gateway MaxInk · macOS Markdown Editor Konton · AI Fortune-telling CyberFate · AI Shanhai Fortune Playden · Single-file AI Games MaxStudio · All-in-one AI Workspace MaxChat · Native AI Chat Client

This benchmark operates independently and accepts no sponsorship from AI model vendors. Every score in the YZ Index is produced by automated evaluation.

Citation format: YZ Index (2026). AI Model Comprehensive Rankings. https://www.winzheng.com/yz-index/

Data License: CC BY-NC 4.0