AI Reviews | Winzheng

Bringing DeepSeek-V4 Flash RL Training to AMD Instinct MI355X GPUs with Miles

Bringing DeepSeek-V4 Flash RL Training to AMD Instinct MI355X GPUs with MilesAMD & Miles TeamJuly 10, 2026DeepSeek-V4 RL is now supported in Miles on AMD Instinct™ MI355X GPUs with ROCm™! RL requi

Serving GLM5.2 NVFP4 Agentic Workload with SGLang: Reaching 500 TPS in 2 Weeks

Serving GLM5.2 NVFP4 Agentic Workload with SGLang: Reaching 500 TPS in 2 WeeksSGLang TeamJuly 14, 2026TL;DR More than 500 TPS on 8xB300 (bs=1) Sync free speculative decoding for GLM 5.2 MTP Built-in I

SGLang and Miles Add Day-0 Support for Inkling, a Frontier Multimodal Model

SGLang and Miles Add Day-0 Support for Inkling, a Frontier Multimodal ModelSGLang Team & Thinking Machines LabJuly 15, 2026We're excited to partner with the Thinking Machines team to bring Day-0 s

OPD Support in Miles

OPD Support in MilesKaixi Hou & Miles TeamJuly 18, 2026We recently implemented On-Policy Distillation (OPD) as an important feature in Miles. OPD is now integrated into Miles rollout and training

Grok 4 Leads with 84.21 Points: 2026-07-24 Smoke Quick Test Data Brief

On July 24, 2026, the Winzheng YZ Index Smoke Quick Test covered 10 models, with Grok 4 scoring 84.21 points to top the daily ranking. This daily 10-question test is designed for short-term signal observation and does not equate to the full weekly ranking conclusions.

GLM-4.6: 93.30 on Material Constraint but Integrity Fail, Code Execution 25.00 Drags Down Leaderboard

In Run#243 Smoke test, GLM-4.6 scored 55.74 on the main leaderboard, with code execution at 25.00, material constraint at 93.30, and an integrity rating of fail (probe score 30.00).

Claude Opus 4.7 Tops with 96.99: 2026-07-23 Smoke Quick Test Data Brief

On 2026-07-23, the YZ Index Smoke Quick Test covered 11 models, with Claude Opus 4.7 ranking first at 96.99. Smoke is a daily 10-question quick test for short-term signals and does not replace the Full weekly ranking.

GLM-4.6 Soars 13.7 Points in WDCD; GPT-o3 Drops 6.9 – Commitment Top Restructured

In the latest WDCD v3.1 commitment test, GLM-4.6 surged 13.7 points over Run #233 to 92.00, while GPT-o3 fell 6.9 points to 87.10, directly reshuffling the top five rankings.

Resource Limitation Scenario Lowest at 1.55 Points: Maximum Spread of 2.45 Points Across 11 Models in WDCD Compliance Test

In the resource limitation scenario, gpt-5.5 scored only 1.55/4, and in business rules, Doubao-pro scored only 1.45/4, directly revealing the weakest constraint types in the WDCD v3.1 compliance test.

R3 Integrity Rate Only 40.9%: Four Models Score Zero in WDCD Business Rule Scenario

In three rounds of testing on 8 v2 anchor questions, the average R3 integrity rate across 11 models was only 40.9%, with 4 models experiencing complete collapse (score 0).

Grok 4 Scores 93.80 to Top the Compliance Test, Doubao Pro Trails at 67.30 with a 26.5-Point Gap

In the WDCD v3.1 compliance test, Grok 4 achieved the highest score of 93.80 among 11 evaluated models, while Doubao Pro scored the lowest at 67.30, a difference of 26.5 points. The top three models formed a clear tier with a significant gap from the rest.

GLM-4.6 Integrity Rating Drops from Pass to Fail, Code Execution Surges by 47 Points

GLM-4.6's integrity rating fell from pass to fail in today's Smoke evaluation, while its code execution score surged by 47 points. However, the overall ranking increase was driven solely by this dimension, suggesting sampling fluctuation rather than genuine improvement.

GPT-o3 Smoke Evaluation Main Leaderboard Plunges 8.3 Points, Code Execution Drops from 100 to 88.3

In today’s Smoke evaluation, GPT-o3’s main leaderboard score fell from 96.27 to 87.94, a drop of 8.3 points. Code execution declined from 100.00 to 88.30, while engineering judgment saw the steepest decline, dropping from 94.80 to 75.00.

Grok 4 Leads with 98.35 Points: 2026-07-22 Smoke Quick Test Data Brief

On July 22, 2026, the YZ Index Smoke quick test covered 11 models, with Grok 4 ranking first at 98.35 points. The Smoke test uses 10 daily questions to monitor short-term signals and is not equivalent to the full weekly ranking conclusion.

Claude Opus 4.7 Smoke Evaluation Main Ranking Drops 26.1 Points, Code Execution and Material Constraints Both Fail

In today's Smoke evaluation, Claude Opus 4.7's main ranking score dropped sharply by 26.1 points to 73.92. The code execution and material constraints dimensions saw significant declines, while engineering judgment remained relatively stable.

Gemini 3.1 Pro Material Constraint Drops 17.8 Points, Main Ranking Falls 6 Points

In today's Smoke evaluation, Gemini 3.1 Pro's material constraint score dropped from 90.40 to 72.60, a decrease of 17.8 points, causing its main ranking to fall from 81.93 to 75.90.

Claude Sonnet 4.6 and GPT-o3 Tie at 96.27: 2026-07-21 Smoke Quick Test Data Brief

On July 21, 2026, the YZ Index Smoke Quick Test covered 11 models, with Claude Sonnet 4.6 and GPT-o3 tying for first place at 96.27 points, showing balanced strengths in code execution and material constraints.

Qwen3 Max Main Score Plunges 14.9 Points, Code Execution Drops from 96.9 to 65.6

In today's Smoke evaluation, Qwen3 Max's main score dropped from 82.23 to 67.31, a decrease of 14.9 points, with the code execution dimension falling from 96.90 to 65.60.

Gemini 2.5 Pro Code Execution Dropped 24.6 Points in a Single Day; Overall Ranking Slid 6.5 Points

In today's Smoke evaluation, Gemini 2.5 Pro's code execution score dropped from 74.60 to 50.00 points (a decrease of 24.6 points), while its overall ranking fell from 76.49 to 69.98 points.

Claude Opus 4.7 Leads with 100 Points: 2026-07-20 Smoke Quick Test Data Brief

On July 20, 2026, the YZ Index Smoke Quick Test covered 11 models, with Claude Opus 4.7 scoring 100 points to top the daily rankings. This Smoke test uses 10 questions per day, suitable for observing short-term signals, and is not equivalent to the Full weekly ranking conclusions.