EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

Feb 4, 2026 986 Views - Read Source LMSYS

LMSYS SGLang EPD VLMs 模型解耦性能优化

TL;DR

SGLang launches Encoder-Prefill-Decode (EPD) disaggregation architecture, separating vision encoding from language processing in Vision Language Models (VLMs), achieving the following advantages:

Independent vision encoding capacity scaling: Encoder servers can scale horizontally without affecting language model deployment, optimizing resource utilization for vision-intensive workloads.
Compatible with existing PD disaggregation: EPD combines with Prefill-Decode disaggregation to form a complete three-tier architecture.
Flexible transport backends: Supports multiple transport mechanisms including ZMQ and Mooncake, adapting to different deployment scenarios.
Vision embedding caching: Frequently used images can be cached on encoder servers, avoiding repeated ViT computations and reducing network transfer overhead.

EPD shows significant effectiveness in image-intensive scenarios (e.g., multi-image inputs) where vision encoding is the main bottleneck. Through EPD, request TTFT is dramatically reduced under load, with latency approximately 6-8x lower compared to colocated solutions (at 1 QPS). In image-sparse scenarios, additional network latency may lead to higher TTFT.

Introduction

Vision Language Models (VLMs) such as Qwen2.5-VL and Llama-Vision integrate visual understanding with language generation but face unique scaling challenges:

Heterogeneous computational requirements: Vision encoding (CNN/ViT) and language decoding (Transformer) have different computational patterns.
Imbalanced resource usage: Vision processing is compute-intensive but only needed during the prefill phase.
Lack of flexibility: Traditional monolithic deployment cannot independently scale vision and language components.
Intra-request parallelism: Different images within the same request can be encoded independently.
Poor tensor-parallel scaling: Vision encoder parameters are far smaller than language components, making tensor parallelism inefficient and unnecessary.

SGLang's existing Prefill-Decode (PD) disaggregation has already separated prefill and decode stages. EPD further separates vision encoding from language prefill, forming a three-tier architecture.

The ViT Scaling Problem: Why Tensor Parallelism Isn't Always Effective

Counterintuitive Finding

A key insight of EPD is that Vision Transformers (ViT) do not benefit from increased tensor parallelism (TP), and higher TP may even be slower:

Qwen2.5-VL-72B benchmark on H20 (4 images per request):

TP	Average ViT Time
2	492.13ms
4	465.80ms
8	523.80ms

Reasons:

Communication overhead dominates execution time.
Vision model weight parameters are typically small.

EPD circumvents this issue by horizontally scaling encoders rather than increasing TP.

Architecture Overview

EPD architecture request flow:

Client Request: Multimodal requests arrive at the prefill server (via load balancer or direct connection).
Image Distribution: The prefill server identifies image inputs and distributes them to one or more encoder servers. Images can be split for load balancing.
Vision Encoding: Encoder servers process images through ViT, generating vision embeddings and image grid metadata. Results are cached if enabled.
Embedding Transport: Vision embeddings are transmitted back to the prefill server via configured transport backend (ZMQ, Mooncake, etc.).
LLM Computation: The prefill server combines vision embeddings with text tokens, forming mm_inputs containing precomputed tensors. The LLM executes Prefill and Decode. If PD is enabled, existing transport logic is reused; otherwise, decoding occurs locally.

Key Components

EPD Workflow

EPD Architecture

Encoder Server (--encoder-only)
- Vision only (no language weights); preprocessing + ViT forward to generate vision embeddings
- Supports prefix multimodal caching
- Horizontally scales for load balancing and multi-image parallel split inference

Prefill Server (--language-only)
- Language model only
- Receives encoder embeddings
- If PD enabled: sends KV to Decode; otherwise decodes locally

Decode Server
- Standard decode-only instance
- Receives KV cache from prefill

Implementation Details

Image Distribution Strategy

Unlike tensor parallelism which splits a single model, EPD uses data parallelism: running multiple independent encoder instances and distributing images.

Example (7 images, 3 encoders):

Request with 7 images: [img0, img1, img2, img3, img4, img5, img6]
3 encoders available

Distribution (after shuffle):
├─ Encoder 0: [img0, img1, img2] (3 images)
├─ Encoder 1: [img3, img4] (2 images)
└─ Encoder 2: [img5, img6] (2 images)

Transport Backends

EPD supports three vision embedding transport backends:

zmq_to_scheduler (default): Direct ZMQ socket communication, sends embeddings to scheduler via RDMA transport engine without blocking.
zmq_to_tokenizer: Embeddings sent to tokenizer manager, processed during tokenization stage.
mooncake: Multi-node RDMA transport, registers embeddings in shared memory for high bandwidth and low latency.

Vision Embedding Caching

Encoders support prefix multimodal caching to avoid repeated ViT computations:

Eliminates redundant vision encoding
Reduces latency for repeated images
Configurable cache size (default 4GB, via SGLANG_VLM_CACHE_SIZE_MB)

Usage Examples

Start encoder instances:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30002

CUDA_VISIBLE_DEVICES=2 taskset -c $1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --enable-prefix-mm-cache \
    --port $PORT

Start prefill instance:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30000
TP=1
MEM_FRACTION=0.5
CHUNK_SIZE=8192

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --mem-fraction-static $MEM_FRACTION \
    --disable-radix-cache \
    --chunked-prefill-size $CHUNK_SIZE \
    --language-only \
    --encoder-urls http://127.0.0.1:30002 http://127.0.0.1:30003 http://127.0.0.1:30004 http://127.0.0.1:30005 http://127.0.0.1:30006 http://127.0.0.1:30007 \
    --port $PORT

Start decode instance:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30001
TP=1

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --port $PORT

Start minlb:

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --mini-lb \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --port 8000

Benchmarking

EPD targets vision-intensive workloads (multi-image requests), improving TTFT through horizontal encoder scaling.

Benchmark script:

python -m sglang.bench_serving \
    --random-image-count \
    --model ${MODEL_PATH} \
    --num-prompts 64 \
    --dataset-name image \
    --random-input-len 128 \
    --random-output-len 256 \
    --image-count 8 \
    --image-resolution 1080p \
    --host $HOST_IP \
    --port $port \
    --backend vllm-chat \
    --request-rate $request_rate

Experimental Setup

Environment: 8× H20 96GB GPU

Model: Qwen3-VL-235B-A22B-Instruct-FP8

Dataset: Random multimodal dataset
- Text tokens: 128 / 256
- Images per request: 1-8 (random, average ~4)
- Image resolution: 1080p
- QPS range: 0.2-1.0

Deployment Configurations:
- Colocate: 1 PD instance, tensor-parallel-size=4, using 4× H20
- 1E1P: 1 encoder (TP=1) + 1 PD (TP=4), using 5× H20
- 2E1P: 2 encoders (TP=1 each) + 1 PD (TP=4), using 6× H20

Test Results

Average TTFT (EPD vs colocate):

TTFT Results

Average TPOT (EPD vs colocate):

TPOT Results

Request throughput (EPD vs colocate):

Throughput Results

Key Findings (vs. colocate):

Under load, encoder/prefill maintains TTFT far lower than colocate (≈6-8x lower at 1 QPS).
TPOT is much lower than colocate (≈8-10x lower), with more compact latency.
Throughput approximately doubles at high QPS (≈2x at 0.8-1.0 QPS).
By allocating encoders with dedicated GPU resources, TTFT is dramatically reduced. Although 2E1P uses 50% more GPUs (6× vs 4×), it achieves higher resource utilization.