TL;DR
SGLang launches Encoder-Prefill-Decode (EPD) disaggregation architecture, separating vision encoding from language processing in Vision Language Models (VLMs), achieving the following advantages:
- Independent vision encoding capacity scaling: Encoder servers can scale horizontally without affecting language model deployment, optimizing resource utilization for vision-intensive workloads.
- Compatible with existing PD disaggregation: EPD combines with Prefill-Decode disaggregation to form a complete three-tier architecture.
- Flexible transport backends: Supports multiple transport mechanisms including ZMQ and Mooncake, adapting to different deployment scenarios.
- Vision embedding caching: Frequently used images can be cached on encoder servers, avoiding repeated ViT computations and reducing network transfer overhead.
EPD shows significant effectiveness in image-intensive scenarios (e.g., multi-image inputs) where vision encoding is the main bottleneck. Through EPD, request TTFT is dramatically reduced under load, with latency approximately 6-8x lower compared to colocated solutions (at 1 QPS). In image-sparse scenarios, additional network latency may lead to higher TTFT.
Introduction
Vision Language Models (VLMs) such as Qwen2.5-VL and Llama-Vision integrate visual understanding with language generation but face unique scaling challenges:
- Heterogeneous computational requirements: Vision encoding (CNN/ViT) and language decoding (Transformer) have different computational patterns.
- Imbalanced resource usage: Vision processing is compute-intensive but only needed during the prefill phase.
- Lack of flexibility: Traditional monolithic deployment cannot independently scale vision and language components.
- Intra-request parallelism: Different images within the same request can be encoded independently.
- Poor tensor-parallel scaling: Vision encoder parameters are far smaller than language components, making tensor parallelism inefficient and unnecessary.
SGLang's existing Prefill-Decode (PD) disaggregation has already separated prefill and decode stages. EPD further separates vision encoding from language prefill, forming a three-tier architecture.
The ViT Scaling Problem: Why Tensor Parallelism Isn't Always Effective
Counterintuitive Finding
A key insight of EPD is that Vision Transformers (ViT) do not benefit from increased tensor parallelism (TP), and higher TP may even be slower:
Qwen2.5-VL-72B benchmark on H20 (4 images per request):
| TP | Average ViT Time |
|---|---|
| 2 | 492.13ms |
| 4 | 465.80ms |
| 8 | 523.80ms |
Reasons:
- Communication overhead dominates execution time.
- Vision model weight parameters are typically small.
EPD circumvents this issue by horizontally scaling encoders rather than increasing TP.
Architecture Overview
EPD architecture request flow:
- Client Request: Multimodal requests arrive at the prefill server (via load balancer or direct connection).
- Image Distribution: The prefill server identifies image inputs and distributes them to one or more encoder servers. Images can be split for load balancing.
- Vision Encoding: Encoder servers process images through ViT, generating vision embeddings and image grid metadata. Results are cached if enabled.
- Embedding Transport: Vision embeddings are transmitted back to the prefill server via configured transport backend (ZMQ, Mooncake, etc.).
- LLM Computation: The prefill server combines vision embeddings with text tokens, forming mm_inputs containing precomputed tensors. The LLM executes Prefill and Decode. If PD is enabled, existing transport logic is reused; otherwise, decoding occurs locally.
Key Components

Encoder Server (--encoder-only)
- Vision only (no language weights); preprocessing + ViT forward to generate vision embeddings
- Supports prefix multimodal caching
- Horizontally scales for load balancing and multi-image parallel split inference
Prefill Server (--language-only)
- Language model only
- Receives encoder embeddings
- If PD enabled: sends KV to Decode; otherwise decodes locally
Decode Server
- Standard decode-only instance
- Receives KV cache from prefill
Implementation Details
Image Distribution Strategy
Unlike tensor parallelism which splits a single model, EPD uses data parallelism: running multiple independent encoder instances and distributing images.
Example (7 images, 3 encoders):
Request with 7 images: [img0, img1, img2, img3, img4, img5, img6]
3 encoders available
Distribution (after shuffle):
├─ Encoder 0: [img0, img1, img2] (3 images)
├─ Encoder 1: [img3, img4] (2 images)
└─ Encoder 2: [img5, img6] (2 images)Transport Backends
EPD supports three vision embedding transport backends:
- zmq_to_scheduler (default): Direct ZMQ socket communication, sends embeddings to scheduler via RDMA transport engine without blocking.
- zmq_to_tokenizer: Embeddings sent to tokenizer manager, processed during tokenization stage.
- mooncake: Multi-node RDMA transport, registers embeddings in shared memory for high bandwidth and low latency.
Vision Embedding Caching
Encoders support prefix multimodal caching to avoid repeated ViT computations:
- Eliminates redundant vision encoding
- Reduces latency for repeated images
- Configurable cache size (default 4GB, via SGLANG_VLM_CACHE_SIZE_MB)
Usage Examples
Start encoder instances:
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30002
CUDA_VISIBLE_DEVICES=2 taskset -c $1 python -m sglang.launch_server \
--model-path $MODEL \
--encoder-only \
--enable-prefix-mm-cache \
--port $PORTStart prefill instance:
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30000
TP=1
MEM_FRACTION=0.5
CHUNK_SIZE=8192
SGLANG_VLM_CACHE_SIZE_MB=0 CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
--model-path $MODEL \
--disaggregation-mode prefill \
--disaggregation-transfer-backend nixl \
--tp $TP \
--mem-fraction-static $MEM_FRACTION \
--disable-radix-cache \
--chunked-prefill-size $CHUNK_SIZE \
--language-only \
--encoder-urls http://127.0.0.1:30002 http://127.0.0.1:30003 http://127.0.0.1:30004 http://127.0.0.1:30005 http://127.0.0.1:30006 http://127.0.0.1:30007 \
--port $PORTStart decode instance:
MODEL=Qwen/Qwen2.5-VL-7B-Instruct
PORT=30001
TP=1
CUDA_VISIBLE_DEVICES=1 python -m sglang.launch_server \
--model-path $MODEL \
--disaggregation-mode decode \
--disaggregation-transfer-backend nixl \
--tp $TP \
--port $PORTStart minlb:
python -m sglang_router.launch_router \
--pd-disaggregation \
--mini-lb \
--prefill http://127.0.0.1:30000 \
--decode http://127.0.0.1:30001 \
--port 8000Benchmarking
EPD targets vision-intensive workloads (multi-image requests), improving TTFT through horizontal encoder scaling.
Benchmark script:
python -m sglang.bench_serving \
--random-image-count \
--model ${MODEL_PATH} \
--num-prompts 64 \
--dataset-name image \
--random-input-len 128 \
--random-output-len 256 \
--image-count 8 \
--image-resolution 1080p \
--host $HOST_IP \
--port $port \
--backend vllm-chat \
--request-rate $request_rateExperimental Setup
Environment: 8× H20 96GB GPU
Model: Qwen3-VL-235B-A22B-Instruct-FP8
Dataset: Random multimodal dataset
- Text tokens: 128 / 256
- Images per request: 1-8 (random, average ~4)
- Image resolution: 1080p
- QPS range: 0.2-1.0
Deployment Configurations:
- Colocate: 1 PD instance, tensor-parallel-size=4, using 4× H20
- 1E1P: 1 encoder (TP=1) + 1 PD (TP=4), using 5× H20
- 2E1P: 2 encoders (TP=1 each) + 1 PD (TP=4), using 6× H20
Test Results
Average TTFT (EPD vs colocate):

Average TPOT (EPD vs colocate):

Request throughput (EPD vs colocate):

Key Findings (vs. colocate):
- Under load, encoder/prefill maintains TTFT far lower than colocate (≈6-8x lower at 1 QPS).
- TPOT is much lower than colocate (≈8-10x lower), with more compact latency.
- Throughput approximately doubles at high QPS (≈2x at 0.8-1.0 QPS).
- By allocating encoders with dedicated GPU resources, TTFT is dramatically reduced. Although 2E1P uses 50% more GPUs (6× vs 4×), it achieves higher resource utilization.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接