AI Reviews | Winzheng

Claude Opus 4.7 Leads at 95.19 Points: 2026-08-03 Smoke Quick Test Data Briefing

On 2026-08-03, the YZ Index Smoke quick test covered 11 models, with Claude Opus 4.7 ranking first at 95.19 points. Smoke is a daily 10-question quick test for observing short-term signals, not equivalent to the Full weekly ranking conclusions.

GLM-4.6 Smoke Evaluation: Main Leaderboard Score 74, Code Execution 82.3, Material Constraint 95, API Failure Leaves Dimensions Missing

GLM-4.6 scored 74.00 on the main leaderboard in today's Smoke evaluation, with 82.30 on code execution and 95.00 on material constraint. Two dimensions are missing due to API failure/timeout and have entered automatic retesting, excluded from this period's ranking.

Qwen3 Max Rallies +36.8 to Lead; Gemini 3.1 Pro Slips 5.6 as Biggest Loser

In the July 28–August 2, 2026 Smoke evaluation, Qwen3 Max posted the largest seven-day gain (+36.8) to close at 96.1, while Gemini 3.1 Pro fell 5.6 points to 94.45, making it the biggest loser. The report also analyzes scoring trajectories, volatility drivers, integrity-rating changes, and implications for users.

Doubao Pro Leads with 96.7 Points: 2026-08-02 Smoke Quick Test Data Briefing

On August 2, 2026, the YZ Index Smoke quick test covered 10 models, with Doubao Pro ranking first at 96.7 points. The briefing covers daily scores across code execution and material constraint dimensions, along with key fluctuations and integrity signals to monitor.

MLPerf Endpoints v0.7: A Foundation Release

MLPerf Endpoints v0.7 marks the foundational release of a buyer-centric AI inference benchmark, publishing initial results from Coreweave, Google, Intel, KRAI, and Nvidia across three benchmarks. The release supports automated submission pipelines, continuous review tooling, and dynamic result visualization, with v1.0 planned for later this year.

SGLang and Miles Add Day-0 Support for Kimi K3

SGLang and Miles Add Day-0 Support for Kimi K3SGLang TeamJuly 27, 2026We are excited to announce Day-0 support for Kimi K3 in SGLang and Miles. K3 is the first open-source model in the 3-trillion-para

Towards Blackwell-Native 8-bit and 4-bit RL: End-to-End MXFP8 and NVFP4 RL in Miles

Towards Blackwell-Native 8-bit and 4-bit RL: End-to-End MXFP8 and NVFP4 RL in MilesZiang Li, humans& and Miles TeamJuly 29, 2026 TL;DR: We implemented two Blackwell-native RL recipes in Miles: end

RadixArk Joins Forces with Google to Bring Full SGLang Features to TPUs

RadixArk Joins Forces with Google to Bring Full SGLang Features to TPUsRadixArk & GoogleJuly 30, 2026RadixArk and Google Cloud are partnering to bring SGLang to TPUs, giving developers ultimate fl

Toward a Cleaner Quantization Stack in SGLang

Toward a Cleaner Quantization Stack in SGLangSGLang X Ascend TeamJuly 28, 2026Quantization has moved from an advanced feature to an essential part of high-throughput LLM serving. As the number of chec

GLM-4.6 Material Constraint Score Plummets 27.3 Points, Main Score Rises 30.2 Points

In today's Smoke evaluation, GLM-4.6's material constraint score dropped from 75.00 to 47.70 points, while its main score rose from 46.29 to 76.47 points.

GPT-o3 Drops 13.9 Points on Today's Main Leaderboard, Losing Ground in Both Code Execution and Material Constraints

GPT-o3 scored 79.28 points on today's Smoke evaluation main leaderboard, down 13.9 points from yesterday's 93.16, with notable declines in both code execution and material constraint dimensions.

Claude Opus 4.7 and Qwen3 Max Tie at 93.39: 2026-08-01 Smoke Quick Test Data Brief

On 2026-08-01, the YZ Index Smoke quick test covered 11 models, with Claude Opus 4.7 and Qwen3 Max tying for first place at 93.39 points. Key signals include GLM-4.6's integrity dropping to warn and multiple models posting sharp overall declines.

Qwen3 Max Material Constraint Drops 20 Points to 47.70, Code Execution Surges 37.8 Points, Main Leaderboard Rises 11.8 Points

In today's Smoke evaluation, Qwen3 Max's material constraint score dropped 20 points to 47.70, while code execution soared 37.8 points to 92.50, lifting the main leaderboard by 11.8 points to 72.34.

Grok 4 Code Execution Plunges 19.5 Points, Material Constraint Rises 23.2 Points, Main Leaderboard Drops Only 0.3

In today's Smoke evaluation, Grok 4's code execution score dropped from 92.00 to 72.50, while material constraint rose from 60.90 to 84.10, and the main leaderboard score slightly fell from 78.01 to 77.72.

DeepSeek V4 Pro Leads with 96.94: 2026-07-31 Smoke Quick Test Data Brief

On 2026-07-31, the YZ Index Smoke quick test covered 10 models, with DeepSeek V4 Pro scoring 96.94 to top the daily rankings. The test focuses on code execution and material constraints, serving as a short-term signal indicator.

DeepSeek V4 Pro Code Execution Drops 25 Points, Main Benchmark Slides 6.7 Points

DeepSeek V4 Pro’s code execution score fell from 100.00 to 75.00 in today’s Smoke evaluation, dragging the main benchmark from 83.53 to 76.85. Material constraint rose 15.7 points, suggesting the drop was driven by small-sample variance rather than a systematic model degradation.

Grok 4's Main Score Plummets 11.3 Points in Smoke Evaluation, Material Constraint Drops 18 Points in a Single Day

Grok 4's main leaderboard score in today's Smoke Evaluation dropped from 89.30 to 78.01, a decline of 11.3 points. The Material Constraint dimension fell 18 points in a single day, directly driving the main score down.

Claude Opus 4.7 and GPT-5.5 Tie at 86.5: 2026-07-30 Smoke Quick Test Data Brief

On 2026-07-30, the YZ Index Smoke quick test covered 11 models, with Claude Opus 4.7 and GPT-5.5 tying for first at 86.5 points. Smoke is a daily 10-question test designed for short-term signal monitoring, not equivalent to the Full weekly ranking.

Claude Duo Up 6.8 Points, Gemini Down 5.6, WDCD Compliance Leaderboard Shifts Dramatically

In the latest WDCD v3.1 compliance test, Claude Opus 4.7 rose 6.8 points from Run #247, Claude Sonnet 4.6 rose 6.7 points, Gemini 3.1 Pro dropped 5.6 points, and GPT-5.5 rose 5.3 points, revealing clear divergence in model constraint survival under multi-round progressive pressure.

WDCD Five-Scenario Cross-Evaluation: Business Rules Lowest Across All Models, Engineering Standards Show Cruel 3-Point Gap

Claude-opus-4.7 scored 4/4 in data boundaries, resource constraints, and security compliance, but only 3/4 in engineering standards, making it the most pronounced case of uneven performance in the WDCD v3.1 five-scenario evaluation. DeepSeek-v4-pro and GLM-4.6 led in engineering standards at 4/4, while gemini-2.5-pro dropped to 1/4, highlighting a significant gap.