Since its release in early November 2025, SGLang-Diffusion has garnered widespread attention and adoption in the community. We are deeply grateful for the extensive feedback and contributions from open-source developers.
Over the past two months, we have refined and optimized SGLang-Diffusion, and the current version (docker image tag: lmsysorg/sglang:dev-pr-17247) is up to 2.5x faster than the initial release.
Overview
New Model Support
- Support for various new models including Flux.2, Qwen-Image-Edit-2511, Z-Image-Turbo, and more.
- Compatible with diffusers backend, with more improvements planned (see Issue #16642).
LoRA Support
We support LoRA formats for almost all supported models. Here are some tested and verified LoRA examples:
| Base Model | Supported LoRAs |
|---|---|
| Wan2.2 | lightx2v/Wan2.2-Distill-LorasCseti/wan2.2-14B-Arcane_Jinx-lora-v1 |
| Wan2.1 | lightx2v/Wan2.1-Distill-Loras |
| Z-Image-Turbo | tarn59/pixel_art_style_lora_z_image_turbowcde/Z-Image-Turbo-DeJPEG-Lora |
| Qwen-Image | lightx2v/Qwen-Image-Lightningflymy-ai/qwen-image-realism-loraprithivMLmods/Qwen-Image-HeadshotXstarsfriday/Qwen-Image-EVA-LoRA |
| Qwen-Image-Edit | ostris/qwen_image_edit_inpaintinglightx2v/Qwen-Image-Edit-2511-Lightning |
| Flux | dvyio/flux-lora-simple-illustrationXLabs-AI/flux-furry-loraXLabs-AI/flux-RealismLora |
We provide comprehensive HTTP API support for LoRA configuration, merging, and management.
Parallelism
Support for SP and TP modes, as well as hybrid parallelism (combination of Ulysses Parallel, Ring Parallel, and Tensor Parallel).
Hardware Support
Compatible with AMD, 4090, 5090, and MUSA hardware.
SGLang-Diffusion and ComfyUI Integration
We have implemented a flexible ComfyUI custom node that integrates SGLang-Diffusion's high-performance inference engine. Users can improve performance by replacing ComfyUI's loader with the SGL-Diffusion UNET Loader.

SGLang-Diffusion plugin in ComfyUI
Performance Benchmarks
We have conducted multiple performance tests on SGLang-Diffusion, achieving state-of-the-art speeds on NVIDIA GPUs, up to 5x faster than other solutions.
We also conducted performance evaluations on AMD GPUs:
Key Improvements
1. Layerwise Offloading
We introduced LayerwiseOffloadManager and OffloadableDiTMixin to prefetch weights for the next layer during computation and optimize VRAM usage.

Comparison between standard loading and layerwise offloading
2. Kernel Improvements
- Synchronized the latest FlashAttention kernels to eliminate performance lag.
- Optimized QKV processing to reduce intermediate tensor generation.
- RoPE optimization leveraging FlashInfer implementation to reduce overhead.
- Weight fusion to reduce GEMM count.
- CUDA kernel implementation for timesteps.
3. Cache-DiT Integration
We seamlessly integrated Cache-DiT🤗 into SGLang-Diffusion, compatible with various parallel modes, improving generation speed through simple environment variable settings.
4. Other Improvements
- Memory monitoring: Provides peak usage statistics in offline and online workflows.
- Comprehensive performance profiling toolkit.
- Optimization guides included in the Diffusion Cookbook.
Future Plans
- Sparse attention backend
- Quantization support
- Consumer GPU optimizations
- Joint design with
sglang-omni
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接