SGLang-Diffusion: Two Months of Progress
SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.
SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.
SGLang launches a highly optimized Pipeline Parallelism implementation designed for ultra-long context inference challenges. Through integrated optimizations and a clean design, it achieves a 3.31x speedup in prefill throughput for DeepSeek V3 on multi-node H20 clusters, demonstrating strong scalability for trillion-parameter models.
We developed Petit, a collection of FP16/BF16 × FP4 mixed-precision GPU kernels for AMD GPUs, enabling 1.74× faster Llama 3.3 70B inference on existing MI250/MI300 hardware without upgrades.
SGLang implements fully deterministic inference with only 34.35% performance overhead and enables 100% reproducible RL training in collaboration with slime, providing reliable solutions for rigorous scientific experiments.
The SGLang team shares their optimization progress on DeepSeek V3/R1 inference performance using GB200 NVL72, achieving 26,156 input tokens/s for prefill and 13,386 output tokens/s for decode per NVIDIA Blackwell GPU through techniques like FP8 attention, NVFP4 MoE, and large-scale expert parallelism.
This article presents comprehensive optimization strategies for deploying DeepSeek-R1 on H20 GPUs, achieving state-of-the-art performance of 16.5k input tokens/s and 5.7k output tokens/s per node through hardware-aware parallelization, kernel optimizations, and advanced scheduling techniques.
This article introduces PD-Multiplexing, a new serving paradigm in SGLang that leverages NVIDIA's GreenContext technology to achieve higher goodput for LLM services through efficient intra-GPU resource sharing between prefill and decode phases.
SGLang announces Day 0 support for DeepSeek-V3.2, implementing DeepSeek Sparse Attention (DSA) mechanism that significantly improves training and inference efficiency, especially in long-context scenarios.
We conducted an in-depth review of NVIDIA DGX Spark, a compact all-in-one system that brings supercomputing-level performance to desktop workstation form factor. While its unified memory design enables running ultra-large models, performance is constrained by memory bandwidth, making it ideal for prototyping and experimentation rather than production deployment.
SGLang collaborates with NVIDIA to leverage Blackwell architecture innovations, achieving breakthrough performance on DeepSeek models with up to 4x improvements, and is selected as the default inference engine for NVIDIA and AMD hardware in the InferenceMAX benchmark.
We introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA, achieving fast native TPU inference with advanced features like continuous batching, prefix caching, and speculative decoding.
We successfully optimized GPT-OSS 20B and 120B models on NVIDIA DGX Spark using SGLang, achieving state-of-the-art performance of ~70 tokens/s and ~50 tokens/s respectively, enabling fully local AI applications including coding agents.
SGLang announces first-day support for MiniMax M2, a flagship MoE model that returns to full attention after empirical findings show efficient attention methods face significant production deployment challenges.
SGLang Diffusion brings SGLang's top performance to diffusion model image and video generation, supporting mainstream open-source models with 1.2x to 5.9x speedups across diverse workloads.
We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. This integration enables developers to quantize large models using AutoRound's signed gradient optimization techniques and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.
Today we release Miles, an enterprise-grade reinforcement learning framework designed for large-scale MoE training and production workloads, built on the proven foundation of slime.
LMSYS announces its Fellowship Program offering up to $50,000 in funding for U.S. PhD students who have made significant contributions to open-source AI infrastructure.
We implemented an end-to-end FP8 sampling and training pipeline for RL. Experiments show that for MoE models, using BF16 training with FP8 rollout leads to severe train-inference inconsistency as model size increases. Unified FP8 for both training and rollout effectively eliminates quantization-induced inconsistency, improving RL training speed and stability.
This article details how EAGLE-3 (Extrapolative Attention Guided LEarning) was productionized on Vertex AI, achieving 2-3x speedup for LLM inference through lightweight draft heads instead of separate draft models, along with engineering challenges and lessons learned.
SGLang now features native integration with NVIDIA Model Optimizer, enabling direct quantization and deployment within the SGLang ecosystem, achieving up to 2x single-GPU throughput improvements.