AI Reviews | 赢政天下

AI Model Time Zone Reasoning Comparison: Details Determine Success

Eight leading AI models showed clear capability divisions when tested on a seemingly simple time zone conversion question, with 5 models performing perfectly while 3 made calculation errors.

AI Models Show Clear Divide in Logical Reasoning: Half Fall into Reasoning Traps

In a seemingly simple logical reasoning test, 8 mainstream AI models demonstrated starkly different performances with only a 50% success rate, exposing significant disparities in current AI's logical reasoning capabilities.

YZ Index Weekly Report: Collective Decline in Knowledge Work Capabilities, Claude Remains Stable Against the Trend

This week's YZ Index evaluation reveals a rare collective decline in knowledge work capabilities across AI models, with 6 out of 8 mainstream models showing performance degradation. Claude Sonnet 4.6 emerges as the only model with positive growth.

GPT-o3 Knowledge Work Score Plummets 12 Points: Logical Reasoning Ability Suspected to Have Degraded

GPT-o3 experienced a rare cliff-like drop in the knowledge work dimension this week, plunging from 82.4 to 70.3 points, with logical reasoning and translation tasks showing significant deterioration.

GPT-o3 Performance Plummets: Technical Concerns Behind 12.1-Point Drop in Knowledge Work Capabilities

GPT-o3 experienced severe performance degradation in knowledge work this week, with scores plunging from 82.4 to 70.3 points, primarily affecting logical reasoning and language comprehension capabilities.

In-Depth Analysis: From DeepSeek to Gemini, How to Build an Impregnable Defense Against "Model Distillation"?

This article analyzes the DeepSeek model distillation incident and proposes a comprehensive multi-layered defense system against distillation attacks, including API-level controls, output watermarking, and architectural protections.

KTransformers Accelerates SGLang's Heterogeneous Inference

KTransformers, developed by Tsinghua University's MadSys and Approaching.AI, optimizes CPU/GPU collaborative inference for sparse MoE models through AMX-optimized kernels, efficient device coordination, and expert deferral mechanisms, now integrated into SGLang for enhanced performance.

SGLang-Diffusion: Two Months of Progress

SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.

SGLang Pipeline Parallelism: Million-Token Context Extension and Performance Breakthroughs

SGLang launches a highly optimized Pipeline Parallelism implementation designed for ultra-long context inference challenges. Through integrated optimizations and a clean design, it achieves a 3.31x speedup in prefill throughput for DeepSeek V3 on multi-node H20 clusters, demonstrating strong scalability for trillion-parameter models.

FP4 Mixed-Precision Inference Optimization on AMD GPUs

We developed Petit, a collection of FP16/BF16 × FP4 mixed-precision GPU kernels for AMD GPUs, enabling 1.74× faster Llama 3.3 70B inference on existing MI250/MI300 hardware without upgrades.

SGLang Achieves Deterministic Inference and Reproducible RL Training

SGLang implements fully deterministic inference with only 34.35% performance overhead and enables 100% reproducible RL training in collaboration with slime, providing reliable solutions for rigorous scientific experiments.

GB200 NVL72 Deployment DeepSeek Optimization (Part 2): 3.8x Prefill and 4.8x Decode Throughput

The SGLang team shares their optimization progress on DeepSeek V3/R1 inference performance using GB200 NVL72, achieving 26,156 input tokens/s for prefill and 13,386 output tokens/s for decode per NVIDIA Blackwell GPU through techniques like FP8 attention, NVFP4 MoE, and large-scale expert parallelism.

Partnering with SGLang: Best Practices for Efficiently Deploying DeepSeek-R1 on H20-96G

This article presents comprehensive optimization strategies for deploying DeepSeek-R1 on H20 GPUs, achieving state-of-the-art performance of 16.5k input tokens/s and 5.7k output tokens/s per node through hardware-aware parallelization, kernel optimizations, and advanced scheduling techniques.

PD-Multiplexing: A New Paradigm for High-Goodput LLM Serving Driven by GreenContext

This article introduces PD-Multiplexing, a new serving paradigm in SGLang that leverages NVIDIA's GreenContext technology to achieve higher goodput for LLM services through efficient intra-GPU resource sharing between prefill and decode phases.

SGLang Supports DeepSeek V3.2 Sparse Attention Mechanism from Day 0

SGLang announces Day 0 support for DeepSeek-V3.2, implementing DeepSeek Sparse Attention (DSA) mechanism that significantly improves training and inference efficiency, especially in long-context scenarios.

NVIDIA DGX Spark In-Depth Review: A New Benchmark for Local AI Inference

We conducted an in-depth review of NVIDIA DGX Spark, a compact all-in-one system that brings supercomputing-level performance to desktop workstation form factor. While its unified memory design enables running ultra-large models, performance is constrained by memory bandwidth, making it ideal for prototyping and experimentation rather than production deployment.

SGLang and NVIDIA Partner to Accelerate InferenceMAX Benchmark and GB200 Performance

SGLang collaborates with NVIDIA to leverage Blackwell architecture innovations, achieving breakthrough performance on DeepSeek models with up to 4x improvements, and is selected as the default inference engine for NVIDIA and AMD hardware in the InferenceMAX benchmark.