Original AI News | Winzheng

SGLang-Diffusion: Two Months of Progress

SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.

SGLang Pipeline Parallelism: Million-Token Context Extension and Performance Breakthroughs

SGLang launches a highly optimized Pipeline Parallelism implementation designed for ultra-long context inference challenges. Through integrated optimizations and a clean design, it achieves a 3.31x speedup in prefill throughput for DeepSeek V3 on multi-node H20 clusters, demonstrating strong scalability for trillion-parameter models.

FP4 Mixed-Precision Inference Optimization on AMD GPUs

We developed Petit, a collection of FP16/BF16 × FP4 mixed-precision GPU kernels for AMD GPUs, enabling 1.74× faster Llama 3.3 70B inference on existing MI250/MI300 hardware without upgrades.

SGLang Achieves Deterministic Inference and Reproducible RL Training

SGLang implements fully deterministic inference with only 34.35% performance overhead and enables 100% reproducible RL training in collaboration with slime, providing reliable solutions for rigorous scientific experiments.

GB200 NVL72 Deployment DeepSeek Optimization (Part 2): 3.8x Prefill and 4.8x Decode Throughput

The SGLang team shares their optimization progress on DeepSeek V3/R1 inference performance using GB200 NVL72, achieving 26,156 input tokens/s for prefill and 13,386 output tokens/s for decode per NVIDIA Blackwell GPU through techniques like FP8 attention, NVFP4 MoE, and large-scale expert parallelism.

Partnering with SGLang: Best Practices for Efficiently Deploying DeepSeek-R1 on H20-96G

This article presents comprehensive optimization strategies for deploying DeepSeek-R1 on H20 GPUs, achieving state-of-the-art performance of 16.5k input tokens/s and 5.7k output tokens/s per node through hardware-aware parallelization, kernel optimizations, and advanced scheduling techniques.

PD-Multiplexing: A New Paradigm for High-Goodput LLM Serving Driven by GreenContext

This article introduces PD-Multiplexing, a new serving paradigm in SGLang that leverages NVIDIA's GreenContext technology to achieve higher goodput for LLM services through efficient intra-GPU resource sharing between prefill and decode phases.

SGLang Supports DeepSeek V3.2 Sparse Attention Mechanism from Day 0

SGLang announces Day 0 support for DeepSeek-V3.2, implementing DeepSeek Sparse Attention (DSA) mechanism that significantly improves training and inference efficiency, especially in long-context scenarios.

NVIDIA DGX Spark In-Depth Review: A New Benchmark for Local AI Inference

We conducted an in-depth review of NVIDIA DGX Spark, a compact all-in-one system that brings supercomputing-level performance to desktop workstation form factor. While its unified memory design enables running ultra-large models, performance is constrained by memory bandwidth, making it ideal for prototyping and experimentation rather than production deployment.

SGLang and NVIDIA Partner to Accelerate InferenceMAX Benchmark and GB200 Performance

SGLang collaborates with NVIDIA to leverage Blackwell architecture innovations, achieving breakthrough performance on DeepSeek models with up to 4x improvements, and is selected as the default inference engine for NVIDIA and AMD hardware in the InferenceMAX benchmark.

SGLang-Jax: An Open-Source Tool for Native TPU Inference

We introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA, achieving fast native TPU inference with advanced features like continuous batching, prefix caching, and speculative decoding.

Optimizing GPT-OSS on NVIDIA DGX Spark: Unleashing Spark's Maximum Potential

We successfully optimized GPT-OSS 20B and 120B models on NVIDIA DGX Spark using SGLang, achieving state-of-the-art performance of ~70 tokens/s and ~50 tokens/s respectively, enabling fully local AI applications including coding agents.

No Free Lunch: MiniMax M2 Deconstructs Efficient Attention Mechanisms

SGLang announces first-day support for MiniMax M2, a flagship MoE model that returns to full attention after empirical findings show efficient attention methods face significant production deployment challenges.

SGLang Diffusion: Accelerating Video and Image Generation

SGLang Diffusion brings SGLang's top performance to diffusion model image and video generation, supporting mainstream open-source models with 1.2x to 5.9x speedups across diverse workloads.

🚀 AutoRound Partners with SGLang: A New Era of Efficient Quantized Model Inference

We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. This integration enables developers to quantize large models using AutoRound's signed gradient optimization techniques and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.

Miles Released: Enterprise-Grade RL Framework Igniting Large-Scale MoE Training

Today we release Miles, an enterprise-grade reinforcement learning framework designed for large-scale MoE training and production workloads, built on the proven foundation of slime.

LMSYS Fellowship Program Officially Launches

LMSYS announces its Fellowship Program offering up to $50,000 in funding for U.S. PhD students who have made significant contributions to open-source AI infrastructure.

Unified FP8: Beyond Mixed Precision, Achieving Stable Accelerated MoE RL Training

We implemented an end-to-end FP8 sampling and training pipeline for RL. Experiments show that for MoE models, using BF16 training with FP8 rollout leads to severe train-inference inconsistency as model size increases. Unified FP8 for both training and rollout effectively eliminates quantization-induced inconsistency, improving RL training speed and stability.

From Research to Production: EAGLE-3 Accelerates Open Source LLM Inference 2-3x on Vertex AI

This article details how EAGLE-3 (Extrapolative Attention Guided LEarning) was productionized on Vertex AI, achieving 2-3x speedup for LLM inference through lightweight draft heads instead of separate draft models, along with engineering challenges and lessons learned.

SGLang Inference Acceleration: Native Integration with NVIDIA Model Optimizer for Seamless Quantized Deployment

SGLang now features native integration with NVIDIA Model Optimizer, enabling direct quantization and deployment within the SGLang ecosystem, achieving up to 2x single-GPU throughput improvements.