SGLang Diffusion: Accelerating Video and Image Generation

We are excited to introduce SGLang Diffusion, which brings SGLang's leading performance to the realm of diffusion model image and video generation.

SGLang Diffusion supports mainstream open-source video and image generation models, including the Wan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, and Flux, while achieving fast inference and ease of use through multiple API interfaces (OpenAI-compatible API, CLI, Python interface). It delivers 1.2x to 5.9x speedups across diverse workloads.

In collaboration with the FastVideo team, we have built a complete ecosystem for diffusion models, from post-training to production deployment. The code is open-sourced on GitHub.

SGLang Diffusion Performance Benchmarks on H100 GPU

SGLang Diffusion Performance Benchmarks on H200 GPU

Why Bring Diffusion to SGLang?

As diffusion models become the core technology for image and video generation, the community has strongly called for extending SGLang's high performance and seamless experience to these modalities. We developed SGLang Diffusion to address this need, providing a unified high-performance engine that supports both language and diffusion tasks.

This unified approach is crucial because future generative technologies will converge architectures. Pioneering models like ByteDance's Bagel, Meta's Transfusion, and NVIDIA's Fast-dLLM v2 already combine autoregressive (AR) and diffusion methods. SGLang Diffusion is designed as a future-proof high-performance solution.

Architecture

SGLang Diffusion builds on SGLang's mature serving architecture, inheriting its powerful scheduler and optimized sgl-kernel, ensuring both performance and flexibility.

At its core is ComposedPipelineBase, a flexible abstraction that orchestrates multiple modular PipelineStage components, such as DenoisingStage for denoising loops or DecodingStage for VAE decoding, allowing developers to easily build custom pipelines.

To achieve top speeds, we integrate advanced parallelism techniques: core Transformers support Unified Sequence Parallelism (USP, including Ulysses-SP and Ring-Attention), while other components support CFG-parallelism and tensor parallelism (TP).

The system is based on an enhanced FastVideo branch, developed through close collaboration with their team: SGLang Diffusion focuses on inference acceleration, while FastVideo provides training support such as model distillation.

Model Support

We support popular open-source video and image generation models:

  • Video models: Wan series, FastWan, Hunyuan
  • Image models: Qwen-Image, Qwen-Image-Edit, Flux

See the complete support list here.

Usage

We provide CLI, Python engine API, and OpenAI-compatible API for easy integration.

Installation

# Via pip or uv
uv pip install 'sglang[diffusion]' --prerelease=allow

# From source
 git clone https://github.com/sgl-project/sglang.git
 cd sglang
 uv pip install -e "python[diffusion]" --prerelease=allow

CLI

Start the server and send requests:

sglang serve --model-path black-forest-labs/FLUX.1-dev --port 3000

curl http://127.0.0.1:3000/v1/images/generations \
  -o >(jq -r '.data[0].b64_json' | base64 --decode > example.png) \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "black-forest-labs/FLUX.1-dev",
    "prompt": "A cute baby sea otter",
    "n": 1,
    "size": "1024x1024",
    "response_format": "b64_json"
  }'

Or generate images directly:

sglang generate --model-path black-forest-labs/FLUX.1-dev \
  --prompt "A Logo With Bold Large Text: SGL Diffusion" \
  --save-output

See the Installation Guide and CLI Guide for details.

Demo

Text-to-Video: Wan-AI/Wan2.1

sglang generate --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A curious raccoon" \
    --save-output

Download Video

Image-to-Video: Wan-AI/Wan2.1-I2V

sglang generate --model-path=Wan-AI/Wan2.1-I2V-14B-480P-Diffusers \
    --prompt="Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard..." \
    --image-path="https://github.com/Wan-Video/Wan2.2/blob/990af50de458c19590c245151197326e208d7191/examples/i2v_input.JPG?raw=true" \
    --num-gpus 2 --enable-cfg-parallel --save-output

Download Video

Text-to-Image: FLUX

sglang generate --model-path black-forest-labs/FLUX.1-dev \
    --prompt "A Logo With Bold Large Text: SGL Diffusion" \
    --save-output
Text-to-Image: FLUX

Text-to-Image: Qwen-Image

sglang generate --model-path=Qwen/Qwen-Image \
    --prompt='A curious raccoon' \
    --width=720 --height=720 --save-output
Text-to-Image: Qwen-Image

Image-to-Image: Qwen-Image-Edit

sglang generate --model-path=Qwen/Qwen-Image-Edit \
    --prompt="Convert 2D style to 3D style" --image-path="https://github.com/lm-sys/lm-sys.github.io/releases/download/test/TI2I_Qwen_Image_Edit_Input.jpg" \
    --width=1536 --height=1024 --save-output
Input Image
Input
Output Image
Output

Performance Benchmarks

As shown in the charts at the top, SGLang Diffusion achieves top performance in both image and video generation compared to popular baselines like Hugging Face Diffusers. Parallel configurations such as CFG-Parallel and USP deliver significant speedups compared to single GPU.

Roadmap and Diffusion Ecosystem

We are collaborating with the FastVideo team to build a comprehensive diffusion ecosystem, providing end-to-end solutions from model training to high-performance inference.