SGLang-Diffusion: Two Months of Progress

Feb 4, 2026 1,000 Views - Read Source LMSYS

LMSYS AI Technology 深度学习性能优化开源

Since its release in early November 2025, SGLang-Diffusion has garnered widespread attention and adoption in the community. We are deeply grateful for the extensive feedback and contributions from open-source developers.

Over the past two months, we have refined and optimized SGLang-Diffusion, and the current version (docker image tag: lmsysorg/sglang:dev-pr-17247) is up to 2.5x faster than the initial release.

Overview

New Model Support

Support for various new models including Flux.2, Qwen-Image-Edit-2511, Z-Image-Turbo, and more.
Compatible with diffusers backend, with more improvements planned (see Issue #16642).

LoRA Support

We support LoRA formats for almost all supported models. Here are some tested and verified LoRA examples:

Base Model	Supported LoRAs
Wan2.2	`lightx2v/Wan2.2-Distill-Loras` `Cseti/wan2.2-14B-Arcane_Jinx-lora-v1`
Wan2.1	`lightx2v/Wan2.1-Distill-Loras`
Z-Image-Turbo	`tarn59/pixel_art_style_lora_z_image_turbo` `wcde/Z-Image-Turbo-DeJPEG-Lora`
Qwen-Image	`lightx2v/Qwen-Image-Lightning` `flymy-ai/qwen-image-realism-lora` `prithivMLmods/Qwen-Image-HeadshotX` `starsfriday/Qwen-Image-EVA-LoRA`
Qwen-Image-Edit	`ostris/qwen_image_edit_inpainting` `lightx2v/Qwen-Image-Edit-2511-Lightning`
Flux	`dvyio/flux-lora-simple-illustration` `XLabs-AI/flux-furry-lora` `XLabs-AI/flux-RealismLora`

We provide comprehensive HTTP API support for LoRA configuration, merging, and management.

Parallelism

Support for SP and TP modes, as well as hybrid parallelism (combination of Ulysses Parallel, Ring Parallel, and Tensor Parallel).

Hardware Support

Compatible with AMD, 4090, 5090, and MUSA hardware.

SGLang-Diffusion and ComfyUI Integration

We have implemented a flexible ComfyUI custom node that integrates SGLang-Diffusion's high-performance inference engine. Users can improve performance by replacing ComfyUI's loader with the SGL-Diffusion UNET Loader.

SGLang-Diffusion plugin in ComfyUI

Performance Benchmarks

We have conducted multiple performance tests on SGLang-Diffusion, achieving state-of-the-art speeds on NVIDIA GPUs, up to 5x faster than other solutions.

We also conducted performance evaluations on AMD GPUs:

Key Improvements

1. Layerwise Offloading

We introduced LayerwiseOffloadManager and OffloadableDiTMixin to prefetch weights for the next layer during computation and optimize VRAM usage.

Comparison between standard loading and layerwise offloading

2. Kernel Improvements

Synchronized the latest FlashAttention kernels to eliminate performance lag.
Optimized QKV processing to reduce intermediate tensor generation.
RoPE optimization leveraging FlashInfer implementation to reduce overhead.
Weight fusion to reduce GEMM count.
CUDA kernel implementation for timesteps.

3. Cache-DiT Integration

We seamlessly integrated Cache-DiT🤗 into SGLang-Diffusion, compatible with various parallel modes, improving generation speed through simple environment variable settings.

4. Other Improvements

Memory monitoring: Provides peak usage statistics in offline and online workflows.
Comprehensive performance profiling toolkit.
Optimization guides included in the Diffusion Cookbook.

Future Plans

Sparse attention backend
Quantization support
Consumer GPU optimizations
Joint design with sglang-omni