AI Reviews | Winzheng

WDCD Three-Round Anchor Test: R3 Integrity Rate Only 45.5%, GPT-5.5 and Qwen3 Max Collapse Rate 20%

In a three-round test of only 8 v2 anchor questions, the average R1 confirmation rate across 11 models was 0.95, R2 resistance rate was 0.86, and R3 integrity rate dropped to 45.5%, with 9 instances of complete collapse scoring 0. This data directly reveals a cliff-like decline in models' ability to adhere to constraints under sustained pressure.

Grok 4 Tops WDCD Compliance Leaderboard with 94.80 Points, Doubao Pro Trails at 64.20 Points, a 30-Point Gap

In the WDCD v3.1 compliance test, Grok 4 ranked first with 94.80 points, while Doubao Pro placed 11th with 64.20 points, a difference of 30.6 points.

Grok 4 Leads at 89.3 Points: 2026-07-29 Smoke Quick Test Data Brief

On July 29, 2026, the Winzheng YZ Index Smoke quick test covered 10 models, with Grok 4 topping the daily standings at 89.3 points. This single-day test serves as a short-term signal monitor, not equivalent to the full-week benchmark conclusions.

Claude Sonnet 4.6 Code Execution Drops 22 Points, Material Compliance Rises 25.7 Points

In today’s Smoke evaluation, Claude Sonnet 4.6 saw a 22-point drop in code execution score while material compliance surged by 25.7 points, indicating sample variance rather than capability degradation.

DeepSeek V4 Pro Code Execution Plunges 25 Points, Material Constraint Rises 26.8 Points

In today's Smoke evaluation, DeepSeek V4 Pro's code execution score dropped from 100.00 to 75.00, while material constraint rose from 68.20 to 95.00, resulting in a main leaderboard score change from 85.69 to 84.00.

Gemini 3.1 Pro tops with 100 points: 2026-07-28 Smoke Quick Test Data Brief

2026-07-28 YZ Index Smoke Quick Test covered 11 models, with Gemini 3.1 Pro scoring 100 to top the day. Smoke is a daily 10-question quick test, suitable for observing short-term signals, not equivalent to the Full weekly ranking conclusions.

DeepSeek V4 Pro Material Constraint Plunges 31.8 Points While Code Execution Jumps from 69.5 to 100

In today's Smoke evaluation, DeepSeek V4 Pro's Material Constraint score dropped from 100.00 to 68.20 points, a decrease of 31.8 points, while its Code Execution score rose from 69.50 to 100.00 points, an increase of 30.5 points.

GPT-o3 Code Execution Surges 52.5 Points, Material Constraint Drops 15.7 Points, Main Leaderboard Rises 21.8 Points

GPT-o3's code execution score jumped from 44.50 to 97.00 in today's Smoke benchmark, while its material constraint score fell from 100.00 to 84.30. The main leaderboard score increased from 69.48 to 91.29.

GPT-o3 Tops with 91.29 Points: 2026-07-27 Smoke Quick Test Data Brief

GPT-o3 scored 91.29 to top the YZ Index Smoke quick test on July 27, 2026, covering 11 models. The daily test focuses on Code Execution and Material Constraints.

Grok 4 Leads with 94.20 in Compliance, Claude and Gemini Both Drop Over 5 Points

In the latest WDCD v3.1 trial, Grok 4 achieved 94.20 points, while Claude Opus 4.7 and Gemini 3.1 Pro dropped by 5.9 and 5.6 points respectively compared to Run #242. The remaining nine models showed no improvement, with the bottom three of the top five scoring in the 83 range.

WDCD Five-Scenario Review: Business Rules Become the Hardest, Grok-4 Scores Perfect 4, Claude-sonnet Only 1.8

In the WDCD v3.1 compliance test, the business rules scenario recorded the lowest average score, with Claude-sonnet-4.6 scoring only 1.8/4 and Grok-4 achieving a perfect 4/4, a gap of 2.2 points.

R3 Integrity Rate Only 50.6%: Grok 4 Zero Collapse, GPT-o3 and Qwen3 Max at 20% Collapse

In the WDCD v3.1 pilot, tests on eight v2 three-round anchor problems showed that 11 models achieved an average R3 integrity rate of just 50.6%. Grok 4 demonstrated a perfect resistance score of 1.63/2 with zero collapses, while GPT-o3 and Qwen3 Max each recorded a 20% collapse rate.

DeepSeek V4 Pro Tops with 83.23: 2026-07-26 Smoke Quick Test Data Brief

On 2026-07-26, the YZ Index Smoke quick test covered 10 models, with DeepSeek V4 Pro ranking first with a score of 83.23. Smoke is a daily 10-question quick test suitable for observing short-term signals and is not equivalent to the Full weekly ranking conclusion.

Claude Sonnet 4.6 and Grok 4 Tie at 96.98: 2026-07-25 Smoke Test Data Brief

On July 25, 2026, the YZ Index Smoke test covered 11 models, with Claude Sonnet 4.6 and Grok 4 both scoring 96.98, tying for first place. This quick test monitors short-term signals and does not replace the full weekly ranking.

The Benchmark Behind the Next Wave of Ultra-Low-Power AI

Machine learning has moved beyond data centers into battery-powered devices with milliwatt power budgets. MLPerf Tiny provides a fair, architecture-neutral benchmark to compare performance and efficiency across radically different ultra-low-power systems.

Agentic Inference for MLPerf Inference

MLPerf Inference introduces a new Agentic Inference benchmark targeting multi-turn agentic workloads such as coding assistants and workflow agents. It uses real-world traces to evaluate inference serving stacks under long context, KV-cache reuse, and variable output lengths.

Call for Submission: Edge Agentic Inference Benchmark for MLPerf Inference v6.1

MLCommons introduces the new Edge Agentic Inference benchmark in MLPerf Inference v6.1, using Qwen3.6-27B with Q4_K_M quantization to measure accuracy and latency for on-device agentic LLM workloads. Submissions are due July 31, 2026.

MedPerf Meets Google Cloud Confidential Computing: Secure AI Benchmarking for Brain Tumor Research

At Google Cloud Next 2026, MLCommons Medical AI Working Group and Google Cloud announced that MedPerf, MLCommons' federated benchmarking orchestrator, now supports Google Cloud's Confidential Computing capabilities, demonstrated through a clinically relevant brain tumor segmentation scenario to protect both patient data and AI model IP.

Accelerating SGLang HiCache with Netpreme X-Mem™ MPU

Accelerating SGLang HiCache with Netpreme X-Mem™ MPUNetpreme TeamJuly 8, 2026 Netpreme X-Mem™ Memory Processing Unit (MPU) makes SGLang HiCache faster and more scalable by augmenting the slower Host D

DSpark in SGLang: Speculative Decoding with Confidence-Driven, Variable-Length Verification

DSpark in SGLang: Speculative Decoding with Confidence-Driven, Variable-Length VerificationSGLang TeamJuly 6, 2026Speculative decoding trades extra compute for fewer decode steps, and the trade sours