SGLang-Diffusion: Two Months of Progress

Since its release in early November 2025, SGLang-Diffusion has garnered widespread attention and adoption in the community. We are deeply grateful for the extensive feedback and contributions from open-source developers.

Over the past two months, we have refined and optimized SGLang-Diffusion, and the current version (docker image tag: lmsysorg/sglang:dev-pr-17247) is up to 2.5x faster than the initial release.

Overview

New Model Support

  • Support for various new models including Flux.2, Qwen-Image-Edit-2511, Z-Image-Turbo, and more.
  • Compatible with diffusers backend, with more improvements planned (see Issue #16642).

LoRA Support

We support LoRA formats for almost all supported models. Here are some tested and verified LoRA examples:

Base ModelSupported LoRAs
Wan2.2lightx2v/Wan2.2-Distill-Loras
Cseti/wan2.2-14B-Arcane_Jinx-lora-v1
Wan2.1lightx2v/Wan2.1-Distill-Loras
Z-Image-Turbotarn59/pixel_art_style_lora_z_image_turbo
wcde/Z-Image-Turbo-DeJPEG-Lora
Qwen-Imagelightx2v/Qwen-Image-Lightning
flymy-ai/qwen-image-realism-lora
prithivMLmods/Qwen-Image-HeadshotX
starsfriday/Qwen-Image-EVA-LoRA
Qwen-Image-Editostris/qwen_image_edit_inpainting
lightx2v/Qwen-Image-Edit-2511-Lightning
Fluxdvyio/flux-lora-simple-illustration
XLabs-AI/flux-furry-lora
XLabs-AI/flux-RealismLora

We provide comprehensive HTTP API support for LoRA configuration, merging, and management.

Parallelism

Support for SP and TP modes, as well as hybrid parallelism (combination of Ulysses Parallel, Ring Parallel, and Tensor Parallel).

Hardware Support

Compatible with AMD, 4090, 5090, and MUSA hardware.

SGLang-Diffusion and ComfyUI Integration

We have implemented a flexible ComfyUI custom node that integrates SGLang-Diffusion's high-performance inference engine. Users can improve performance by replacing ComfyUI's loader with the SGL-Diffusion UNET Loader.

SGLang-Diffusion 两个月的进展

SGLang-Diffusion plugin in ComfyUI

Performance Benchmarks

We have conducted multiple performance tests on SGLang-Diffusion, achieving state-of-the-art speeds on NVIDIA GPUs, up to 5x faster than other solutions.

We also conducted performance evaluations on AMD GPUs:

Key Improvements

1. Layerwise Offloading

We introduced LayerwiseOffloadManager and OffloadableDiTMixin to prefetch weights for the next layer during computation and optimize VRAM usage.

SGLang-Diffusion 两个月的进展

Comparison between standard loading and layerwise offloading

2. Kernel Improvements

  • Synchronized the latest FlashAttention kernels to eliminate performance lag.
  • Optimized QKV processing to reduce intermediate tensor generation.
  • RoPE optimization leveraging FlashInfer implementation to reduce overhead.
  • Weight fusion to reduce GEMM count.
  • CUDA kernel implementation for timesteps.

3. Cache-DiT Integration

We seamlessly integrated Cache-DiT🤗 into SGLang-Diffusion, compatible with various parallel modes, improving generation speed through simple environment variable settings.

4. Other Improvements

  • Memory monitoring: Provides peak usage statistics in offline and online workflows.
  • Comprehensive performance profiling toolkit.
  • Optimization guides included in the Diffusion Cookbook.

Future Plans

  • Sparse attention backend
  • Quantization support
  • Consumer GPU optimizations
  • Joint design with sglang-omni