AI Reviews | Winzheng

This Week's 11-Model Overhaul: Newcomer Qwen3 Max Enters with 68.5, Veterans at 75 Exit En Masse

This week’s YZ Index v6 main leaderboard saw six legacy models removed and five new ones added simultaneously, reshuffling the top ten within a single week.

Gemini 3.1 Pro Main Score Plunges 11.1 Points, Code Execution Halved from 100

In today's Smoke quick test, Gemini 3.1 Pro's main score dropped 11.1 points, primarily due to code execution falling from 100 to 75, while material constraint rose slightly to 75.

Qwen3 Max Main Index Plummets 10.9 Points, Code Execution Halved by 25 Points in a Single Day

Qwen3 Max's main index dropped 10.9 points in today's Smoke test, with the code execution dimension falling from a perfect 100 to 75. This one-day fluctuation exceeds the normal random variance and requires serious attention.

GPT-5.5 Main Ranking Plunges 23.5 Points, Doubao Pro 97.75 Tops Smoke

Today's Smoke lightweight evaluation results show Doubao Pro leading with 97.75 points (Execution 100, Constraint 95), becoming the only model among 11 mainstream models to break 97 points on the main ranking. GPT-5.5, which was previously expected to perform well, scored only 60.58, dropping 23.5 points compared to yesterday.

WDCD Cycle Dramatic Shift: GPT-5.5 Tops with 71.67 Points, Gemini Surges 14.2, Wenxin Crashes

In this WDCD cycle, GPT-5.5 re-establishes the ceiling of instruction adherence with an absolute score of 71.67, while Gemini 2.5 Pro's 14.2-point leap completely overturns the perception that Google models are weak in adherence. Meanwhile, Wenxin Yiyan 4.5 suffers a 7.5-point drop, signaling potential over-alignment issues.

Resource Constraints Become the Hardest Scenario in WDCD, Doubao Scores 3.5 Points in Business Rules, Surpassing GPT

The WDCD five-scenario evaluation reveals that resource constraints is the hardest scenario with the lowest overall scores, while DoubaoPro achieves the highest score in business rules, demonstrating significant model specialization.

R3 Collapse Rate 93.3%! Grok4 WDCD Three-Round Test: First Round Fully Compliant, Last Round Crashes

The WDCD three-round test reveals that model integrity drops to 30.6% under direct pressure in R3, with Grok4 hitting a 93.3% collapse rate, exposing the fragility of safety alignment.

WDCD Commitment Ranking: GPT-5.5 Dominates with 71.67 Points, Grok 4 Trails at 52.5 Points

The WDCD Commitment Test reveals models' true performance under constraints through three rounds of dialogue. GPT-5.5 leads with 71.67 points, while Grok 4 scores only 52.5 points, ranking last—a gap of 19.17 points between the top and bottom.

Claude Sonnet 4.6 dropped 12.3 points on main leaderboard, material constraint plummeted 27.3 points in a single day

Claude Sonnet 4.6 showed abnormal results in today's Smoke test, with the material constraint dimension dropping sharply. The drop may be due to sampling variance but warrants further monitoring.

Claude Opus 4.7 Smoke Evaluation Main Score Plunges 9 Points, Material Constraint Halves 20 Points in a Single Day

In today's Smoke evaluation, Claude Opus 4.7's main score dropped by 9 points from 97.75 to 88.75, primarily due to a sharp decline in the material constraint dimension from 95 to 75 points—a direct loss of 20 points in a single day.

7-Day Smoke Quick Test: Wenxin Yiyan Soars 53 Points, GPT-o3 Leads with -7.8 Decline

This week's 7-day Smoke Quick Test data reveals polarization: Wenxin Yiyan surged 53.4 points while GPT-o3 fell 7.8 points.

Three Models Tie at 88.75 for First Place; Claude's Duo Plunges 12 Points; Smoke Rankings Undergo Major Shakeup

Today's Smoke Lite evaluation results show a three-way tie for first place at 88.75 points, while the Claude series suffered sharp declines. The shakeup signals that open models are rapidly closing the gap with closed-source leaders.

GPT-5.5's Main Ranking Plunges 28 Points: Is It Real Degradation?

GPT-5.5's code execution score dropped from 100 to 50, causing a 28-point drop in the main ranking. But is this degradation or just sampling noise?

Gemini 2.5 Pro Drops 10 Points: Ability Intact, Credibility Fails

Gemini 2.5 Pro's credibility rating fell from pass to fail, causing a 10-point drop in the main ranking, even though its code execution score remained perfect.

Three Models Plunge by 28 Points, Claude Still Near Perfect Score

Today's YZ Index Smoke lightweight test reveals that three leading models suffered significant drops, while Claude models dominate near-perfect scores with structural advantages in code execution and material constraint.

DeepSeek gains 5 points but fails: 10-question Smoke test alarm

Today's Smoke evaluation shows the main benchmark up by 5 points, but the integrity rating drops from pass to fail, signaling a classic alarm of "seemingly stronger capability but lost trustworthiness at the admission gate."

Claude Sonnet 4.6 Material Grounding Plunges 27.5 Points, But Main Leaderboard Rises Against the Trend by 1.4 Points?

In today's Smoke evaluation, Anthropic's Claude Sonnet 4.6 saw a dramatic split: material grounding scores dropped 27.5 points to 69, while code execution surged 25 points to a perfect 100, with the main leaderboard edging up 1.4 points to 86.05.

Two Zero-Execution Shocks, Claude Holds at 88.75

Today’s Smoke benchmark shows Claude Opus 4.7 leading with 88.75, while two models scored zero in code execution; the real differentiator is material constraint, not execution ability.

GPT-OSS 20B: A Sparse MoE Pretraining Benchmark for MLPerf Training v6.0

MLCommons 为 MLPerf Training v6.0 引入 GPT-OSS 20B 预训练基准，用更小硬件门槛评测 MoE 稀疏训练能力。该基准通过固定验证集、优化器稳定化和统一初始化，将训练波动显著压低，目标是让成绩更真实反映系统效率。

Claude Opus 4.7 Smoke Evaluation Main Chart Plunges 9.6 Points: Degradation Signal or Lottery Farce?

In today's Smoke Evaluation, Claude Opus 4.7's main chart score plummeted from 89.43 to 79.86, a net loss of 9.6 points, with code execution collapsing from a perfect 100 to 75. The sharp drop raises the question of whether this signals model degradation or is merely a random sampling fluctuation.