EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

TL;DR

SGLang launches Encoder-Prefill-Decode (EPD) disaggregation architecture, separating vision encoding from language processing in Vision Language Models (VLMs), achieving the following advantages:

  • Independent vision encoding capacity scaling: Encoder servers can scale horizontally without affecting language model deployment, optimizing resource utilization for vision-intensive workloads.
  • Compatible with existing PD disaggregation: EPD combines with Prefill-Decode disaggregation to form a complete three-tier architecture.
  • Flexible transport backends: Supports multiple transport mechanisms including ZMQ and Mooncake, adapting to different deployment scenarios.
  • Vision embedding caching: Frequently used images can be cached on encoder servers, avoiding repeated ViT computations and reducing network transfer overhead.

EPD shows significant effectiveness in image-intensive scenarios (e.g., multi-image inputs) where vision encoding is the main bottleneck. Through EPD, request TTFT is dramatically reduced under load, with latency approximately 6-8x lower compared to colocated solutions (at 1 QPS). In image-sparse scenarios, additional network latency may lead to higher TTFT.

Introduction

Vision Language Models (VLMs) such as Qwen2.5-VL and Llama-Vision integrate visual understanding with language generation but face unique scaling challenges:

  • Heterogeneous computational requirements: Vision encoding (CNN/ViT) and language decoding (Transformer) have different computational patterns.
  • Imbalanced resource usage: Vision processing is compute-intensive but only needed during the prefill phase.
  • Lack of flexibility: Traditional monolithic deployment cannot independently scale vision and language components.
  • Intra-request parallelism: Different images within the same request can be encoded independently.
  • Poor tensor-parallel scaling: Vision encoder parameters are far smaller than language components, making tensor parallelism inefficient and unnecessary.

SGLang's existing Prefill-Decode (PD) disaggregation has already separated prefill and decode stages. EPD further separates vision encoding from language prefill, forming a three-tier architecture.

The ViT Scaling Problem: Why Tensor Parallelism Isn't Always Effective

Counterintuitive Finding

A key insight of EPD is that Vision Transformers (ViT) do not benefit from increased tensor parallelism (TP), and higher TP may even be slower:

Qwen2.5-VL-72B benchmark on H20 (4 images per request):

TPAverage ViT Time
2492.13ms
4465.80ms
8523.80ms

Reasons:

  1. Communication overhead dominates execution time.
  2. Vision model weight parameters are typically small.

EPD circumvents this issue by horizontally scaling encoders rather than increasing TP.

Architecture Overview

EPD architecture request flow:

  1. Client Request: Multimodal requests arrive at the prefill server (via load balancer or direct connection).
  2. Image Distribution: The prefill server identifies image inputs and distributes them to one or more encoder servers. Images can be split for load balancing.
  3. Vision Encoding: Encoder servers process images through ViT, generating vision embeddings and image grid metadata. Results are cached if enabled.
  4. Embedding Transport: Vision embeddings are transmitted back to the prefill server via configured transport backend (ZMQ, Mooncake, etc.).
  5. LLM Computation: The prefill server combines vision embeddings with text tokens, forming mm_inputs containing precomputed tensors. The LLM executes Prefill and Decode. If PD is enabled, existing transport logic is reused; otherwise, decoding occurs locally.

Key Components

EPD Workflow

EPD Architecture

Encoder Server (--encoder-only)
- Vision only (no language weights); preprocessing + ViT forward to generate vision embeddings
- Supports prefix multimodal caching
- Horizontally scales for load balancing and multi-image parallel split inference

Prefill Server (--language-only)
- Language model only
- Receives encoder embeddings
- If PD enabled: sends KV to Decode; otherwise decodes locally

Decode Server
- Standard decode-only instance
- Receives KV cache from prefill

Implementation Details

Image Distribution Strategy

Unlike tensor parallelism which splits a single model, EPD uses data parallelism: running multiple independent encoder instances and distributing images.

Example (7 images, 3 encoders):

Request with 7 images: [img0, img1, img2, img3, img4, img5, img6]
3 encoders available

Distribution (after shuffle):
├─ Encoder 0: [img0, img1, img2] (3 images)
├─ Encoder 1: [img3, img4] (2 images)
└─ Encoder 2: [img5, img6] (2 images)

Transport Backends

EPD supports three vision embedding transport backends:

  • zmq_to_scheduler (default): Direct ZMQ socket communication, sends embeddings to scheduler via RDMA transport engine without blocking.
  • zmq_to_tokenizer: Embeddings sent to tokenizer manager, processed during tokenization stage.
  • mooncake: Multi-node RDMA transport, registers embeddings in shared memory for high bandwidth and low latency.

Vision Embedding Caching

Encoders support prefix multimodal caching to avoid repeated ViT computations:

  • Eliminates redundant vision encoding
  • Reduces latency for repeated images
  • Configurable cache size (default 4GB, via SGLANG_VLM_CACHE_SIZE_MB)

Usage Examples

Start encoder instances:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30002

CUDA_VISIBLE_DEVICES=2 taskset -c $1 python -m sglang.launch_server \
    --model-path $MODEL \
    --encoder-only \
    --enable-prefix-mm-cache \
    --port $PORT

Start prefill instance:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30000
TP=1
MEM_FRACTION=0.5
CHUNK_SIZE=8192

SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --mem-fraction-static $MEM_FRACTION \
    --disable-radix-cache \
    --chunked-prefill-size $CHUNK_SIZE \
    --language-only \
    --encoder-urls http://127.0.0.1:30002 http://127.0.0.1:30003 http://127.0.0.1:30004 http://127.0.0.1:30005 http://127.0.0.1:30006 http://127.0.0.1:30007 \
    --port $PORT

Start decode instance:

MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30001
TP=1

CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
    --model-path $MODEL \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend nixl \
    --tp $TP \
    --port $PORT

Start minlb:

python -m sglang_router.launch_router \
  --pd-disaggregation \
  --mini-lb \
  --prefill http://127.0.0.1:30000 \
  --decode http://127.0.0.1:30001 \
  --port 8000

Benchmarking

EPD targets vision-intensive workloads (multi-image requests), improving TTFT through horizontal encoder scaling.

Benchmark script:

python -m sglang.bench_serving \
    --random-image-count \
    --model ${MODEL_PATH} \
    --num-prompts 64 \
    --dataset-name image \
    --random-input-len 128 \
    --random-output-len 256 \
    --image-count 8 \
    --image-resolution 1080p \
    --host $HOST_IP \
    --port $port \
    --backend vllm-chat \
    --request-rate $request_rate

Experimental Setup

Environment: 8× H20 96GB GPU

Model: Qwen3-VL-235B-A22B-Instruct-FP8

Dataset: Random multimodal dataset
- Text tokens: 128 / 256
- Images per request: 1-8 (random, average ~4)
- Image resolution: 1080p
- QPS range: 0.2-1.0

Deployment Configurations:
- Colocate: 1 PD instance, tensor-parallel-size=4, using 4× H20
- 1E1P: 1 encoder (TP=1) + 1 PD (TP=4), using 5× H20
- 2E1P: 2 encoders (TP=1 each) + 1 PD (TP=4), using 6× H20

Test Results

Average TTFT (EPD vs colocate):

TTFT Results

Average TPOT (EPD vs colocate):

TPOT Results

Request throughput (EPD vs colocate):

Throughput Results

Key Findings (vs. colocate):

  • Under load, encoder/prefill maintains TTFT far lower than colocate (≈6-8x lower at 1 QPS).
  • TPOT is much lower than colocate (≈8-10x lower), with more compact latency.
  • Throughput approximately doubles at high QPS (≈2x at 0.8-1.0 QPS).
  • By allocating encoders with dedicated GPU resources, TTFT is dramatically reduced. Although 2E1P uses 50% more GPUs (6× vs 4×), it achieves higher resource utilization.