TL;DR
We are excited to introduce the design and implementation of the Diffusion Large Language Model (dLLM) framework within SGLang. By leveraging the existing ChunkedPrefill mechanism, this system achieves:
- Seamless Integration: Built into the SGLang ecosystem without core architecture changes.
- Performance Inheritance: Benefits from existing inference optimization techniques.
- Maximum Flexibility: Users can fully customize diffusion decoding algorithms.
Background
Motivation
Earlier this year, LLaDA debuted as the first Diffusion Large Language Model (dLLM), quickly attracting attention from both academia and industry. This collaborative work between Renmin University of China and Ant Group demonstrated dLLM's unique execution paradigm with superior data understanding capabilities, and inference speeds exceeding Auto-Regressive (AR) models in low-latency small-batch scenarios.
As dLLM parameter scales expand, we observe scaling-law effects similar to AR LLMs. In pursuit of stronger dLLMs, we trained the 100B-parameter LLaDA2.0-flash model.
However, during the training of LLaDA2.0-flash, we faced a series of AI infrastructure engineering challenges, particularly regarding the efficiency and stability of model evaluation and RL post-training.
Challenges
Existing dLLM inference engines cannot meet the evaluation and RL post-training requirements for large-scale models. For example, while Fast-dLLM is excellent, it's better suited for algorithm research and lacks production-grade service capabilities such as batching, scheduling, RL ecosystem integration, and parallelization.
In contrast, SGLang is currently the most popular LLM inference engine, offering multiple advantages:
- Production-Ready: Deployed in thousands of companies, providing mature and reliable engineering capabilities.
- Technical Leadership: Integrates numerous advanced inference optimizations, with continuous community contributions.
- Complete Ecosystem: Highly integrated with RL post-training ecosystems, especially distributed weight GPU P2P updates.
However, SGLang currently only supports Auto-Regressive computation paradigms and hasn't adapted to diffusion computation. Therefore, the challenge is: introducing dLLM support without breaking existing architecture, allowing it to benefit from all SGLang optimizations.
Design
Key Insights
Based on current dLLM development, we distilled several key insights:
- Due to high computational costs and low KV Cache utilization of Bidirectional Attention Diffusion, mainstream dLLMs are shifting toward Block Diffusion architectures.
- The Block Diffusion computation pattern is highly similar to SGLang's existing Chunked-Prefill process.
- Unlike AR models, diffusion language models require various decoding strategies, supporting dedicated interfaces for flexible customization.
Architecture
We leverage SGLang's existing Chunked-Prefill pipeline to implement Block Diffusion LLM computation support. This method seamlessly integrates dLLMs without modifying the core framework, directly benefiting from SGLang's accumulated inference optimizations.
As shown in the diagram, our modifications to SGLang are extremely restrained, touching only the core periphery. The original generate request execution flow remains unchanged, primarily utilizing and modifying Chunked Prefill, focusing on the prefill adder and chunked reqs components.
Chunked Prefill in SGLang aims to maximize GPU utilization, with single chunk sizes typically 2K-16K tokens. But dLLM decoding splits at block level (e.g., 32 token blocks in LLaDA2.0). Following single large request logic would waste GPU performance, necessitating efficient batching solutions. We modified chunked reqs and prefill adder to support multiple Diffusion Block processing within a single compute cycle.
Additionally, at the decoding execution layer, we insert a diffusion algorithm abstraction layer between TP Worker and Model Runner:
- Worker enters a dedicated branch when identifying Diffusion models.
- Calls the Diffusion algorithm's
runfunction. - Internally drives Model Runner through forward iteration loops until complete Block decoding.
Attention Mask
The biggest difference between Block Diffusion and Chunk Prefill single forward propagation lies in attention mask handling:
- Block Diffusion uses block-wise causal masks.
- AR model Chunk Prefill uses traditional token-wise causal masks.
Block Diffusion can be viewed as a functional extension of Chunk Prefill. A single forward pass involves two computational parts, with outputs concatenated:
- Context Query: Bidirectional attention between current
Q_currand existing KV Cache, identical to Chunk Prefill, ensuring current block attends to historical information. - Intra-Block Query: Computation between current
Q_currand its own KV.- Block Diffusion uses bidirectional attention.
- Chunk Prefill uses causal Mask.
Visualizing Q_curr attention mask:
- Chunk Prefill (causal) presents a trapezoidal/triangular mask.
- Block Diffusion (bidirectional) presents a rectangular mask.
Streaming Output Demo
The following animation compares streaming output between LLaDA2.0-flash-CAP (100B / BF16) and gpt-oss-120B (117B / MXFP4). LLaDA2.0-flash-CAP uses SGLang dLLM TP8 on 8×H20, while gpt-oss-120B uses standard AR process on the same hardware.
Task: Implementing quicksort in 10 programming languages—a scenario where diffusion LLMs excel. As shown, LLaDA2.0-flash-CAP achieves throughput of 935 tokens/s, far exceeding gpt-oss-120B's 263 tokens/s.
SGLang dLLM supports streaming output like AR models, but outputs in block units (e.g., 32 tokens).
Usage
Launch Command Example
python3 -m sglang.launch_server \
--model-path inclusionAI/LLaDA2.0-mini \ # Example HF/local path
--dllm-algorithm LowConfidence \
--dllm-algorithm-config ./config.yaml \ # Optional, uses algorithm default config
--host 0.0.0.0 \
--port 30000
Note: Use--dllm-algorithm-configfor advanced configuration of selected--dllm-algorithm. This feature decouples configuration from code, facilitating unified parameter passing for user-customized algorithms.
Client Code Example
Like other supported models, dLLMs can be used via REST API or offline engine API.
Curl generation request example:
curl -X POST "http://127.0.0.1:30000/generate" \
-H "Content-Type: application/json" \
-d '{
"text": [
"<role>SYSTEM</role>detailed thinking off<|role_end|><role>HUMAN</role>Write the number from 1 to 128<|role_end|><role>ASSISTANT</role>",
"<role>SYSTEM<"
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接