SGLang Empowers Diffusion Large Models: LLaDA 2.0 Now Supported

Feb 4, 2026 853 Views - Read Source LMSYS

LMSYS SGLang dLLM LLaDA 2.0 扩散大模型推理优化

TL;DR

We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing ChunkedPrefill mechanism, this system achieves:

Seamless Integration: Built into the SGLang ecosystem without core architecture changes.
Performance Inheritance: Benefits from existing inference optimization techniques.
Maximum Flexibility: Users can fully customize diffusion decoding algorithms.

Background

Motivation

Earlier this year, LLaDA debuted as the first Diffusion Large Language Model (dLLM), quickly attracting attention from both academia and industry. This collaborative work between Renmin University of China and Ant Group demonstrated dLLM's unique execution paradigm with superior data understanding capabilities, and inference speeds exceeding Auto-Regressive (AR) models in low-latency small-batch scenarios.

As dLLM parameter scales expand, we observe scaling-law effects similar to AR LLMs. In pursuit of stronger dLLMs, we trained the 100B-parameter LLaDA2.0-flash model.

However, during the training of LLaDA2.0-flash, we faced a series of AI infrastructure engineering challenges, particularly regarding the efficiency and stability of model evaluation and RL post-training.

Challenges

Existing dLLM inference engines cannot meet the evaluation and RL post-training requirements for large-scale models. For example, while Fast-dLLM is excellent, it's better suited for algorithm research and lacks production-grade service capabilities such as batching, scheduling, RL ecosystem integration, and parallelization.

In contrast, SGLang is currently the most popular LLM inference engine, offering multiple advantages:

Production-Ready: Deployed in thousands of companies, providing mature and reliable engineering capabilities.
Technical Leadership: Integrates numerous advanced inference optimizations, with continuous community contributions.
Complete Ecosystem: Highly integrated with RL post-training ecosystems, especially distributed weight GPU P2P updates.

However, SGLang currently only supports Auto-Regressive computation paradigms and hasn't adapted to diffusion computation. Therefore, the challenge is: introducing dLLM support without breaking existing architecture, allowing it to benefit from all SGLang optimizations.

Design

Key Insights

Based on current dLLM development, we distilled several key insights:

Due to high computational costs and low KV Cache utilization of Bidirectional Attention Diffusion, mainstream dLLMs are shifting toward Block Diffusion architectures.
The Block Diffusion computation pattern is highly similar to SGLang's existing Chunked-Prefill process.
Unlike AR models, diffusion language models require various decoding strategies, supporting dedicated interfaces for flexible customization.

Architecture

We leverage SGLang's existing Chunked-Prefill pipeline to implement Block Diffusion LLM computation support. This method seamlessly integrates dLLMs without modifying the core framework, directly benefiting from SGLang's accumulated inference optimizations.

Main execution flow diagram

As shown in the diagram, our modifications to SGLang are extremely restrained, touching only the core periphery. The original generate request execution flow remains unchanged, primarily utilizing and modifying Chunked Prefill, focusing on the prefill adder and chunked reqs components.

Chunked Prefill in SGLang aims to maximize GPU utilization, with single chunk sizes typically 2K-16K tokens. But dLLM decoding splits at block level (e.g., 32 token blocks in LLaDA2.0). Following single large request logic would waste GPU performance, necessitating efficient batching solutions. We modified chunked reqs and prefill adder to support multiple Diffusion Block processing within a single compute cycle.

Additionally, at the decoding execution layer, we insert a diffusion algorithm abstraction layer between TP Worker and Model Runner:

Worker enters a dedicated branch when identifying Diffusion models.
Calls the Diffusion algorithm's run function.
Internally drives Model Runner through forward iteration loops until complete Block decoding.

Attention Mask

Causal mask comparison diagram

The biggest difference between Block Diffusion and Chunk Prefill single forward propagation lies in attention mask handling:

Block Diffusion uses block-wise causal masks.
AR model Chunk Prefill uses traditional token-wise causal masks.

Block Diffusion can be viewed as a functional extension of Chunk Prefill. A single forward pass involves two computational parts, with outputs concatenated:

Context Query: Bidirectional attention between current Q_curr and existing KV Cache, identical to Chunk Prefill, ensuring current block attends to historical information.
Intra-Block Query: Computation between current Q_curr and its own KV.
- Block Diffusion uses bidirectional attention.
- Chunk Prefill uses causal Mask.

Visualizing Q_curr attention mask:

Chunk Prefill (causal) presents a trapezoidal/triangular mask.
Block Diffusion (bidirectional) presents a rectangular mask.

Streaming Output Demo

The following animation compares streaming output between LLaDA2.0-flash-CAP (100B / BF16) and gpt-oss-120B (117B / MXFP4). LLaDA2.0-flash-CAP uses SGLang dLLM TP8 on 8×H20, while gpt-oss-120B uses standard AR process on the same hardware.

Task: Implementing quicksort in 10 programming languages—a scenario where diffusion LLMs excel. As shown, LLaDA2.0-flash-CAP achieves throughput of 935 tokens/s, far exceeding gpt-oss-120B's 263 tokens/s.

LLaDA2.0-flash-CAP vs gpt-oss-120B output comparison animation

SGLang dLLM supports streaming output like AR models, but outputs in block units (e.g., 32 tokens).

dLLM streaming output animation

Usage

Launch Command Example

python3 -m sglang.launch_server \
  --model-path inclusionAI/LLaDA2.0-mini \ # Example HF/local path
  --dllm-algorithm LowConfidence \
  --dllm-algorithm-config ./config.yaml \ # Optional, uses algorithm default config
  --host 0.0.0.0 \
  --port 30000

Note: Use --dllm-algorithm-config for advanced configuration of selected --dllm-algorithm. This feature decouples configuration from code, facilitating unified parameter passing for user-customized algorithms.

Client Code Example

Like other supported models, dLLMs can be used via REST API or offline engine API.

Curl generation request example:

curl -X POST "http://127.0.0.1:30000/generate" \
     -H "Content-Type: application/json" \
     -d '{
        "text": [
            "<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role>Write the number from 1 to 128<|role_end|><role>ASSISTANT</role>",
            "<role>SYSTEM<"