TL;DR
SGLang introduces a highly optimized Pipeline Parallelism (PP) implementation specifically designed to address ultra-long context inference challenges. By integrating Chunked Pipeline Parallelism, Asynchronous P2P Communication, and a clean and effective Dynamic Chunking mechanism, this PP design achieves industry-leading performance while maintaining seamless compatibility with other parallelism strategies, PD Disaggregation, and HiCache. On multi-node H20 clusters, using PP4 TP8 configuration (chunked prefill size set to 12K), DeepSeek-V3.1's Prefill Throughput improves by 3.31x compared to TP8, which is 30.5% higher than the TP32 solution (2.54x), highlighting PP's architectural advantages in cross-node large-scale expansion. Additionally, this implementation reduces TTFT by up to 67.9% and achieves strong scaling efficiency of 82.8%, providing an efficient open-source path for trillion-parameter models with ultra-long contexts.

DeepSeek-V3.1 Prefill Throughput on H20 (Batch Size = 1, higher is better)
Note: DCK 12288 (σ=0.65) indicates Dynamic Chunking enabled with initial chunked prefill size of 12K and smoothing factor of 0.65.
Introduction
As Large Language Models (LLMs) scale toward trillion-parameter architectures and "infinite" context windows, serving infrastructure needs to shift toward more fine-grained cross-node parallelism strategies. While KV cache technology can reduce redundant computation, it cannot solve the high Time to First Token (TTFT) caused by ultra-long sequence initial Input Token Length (ITL). Tensor Parallelism (TP), while suitable for intra-node scaling, often encounters communication bottlenecks in multi-node deployments. Traditional Pipeline Parallelism (PP), though it reduces communication volume, faces resource underutilization and bubble overhead issues when handling massive prompts.
SGLang draws from open-source innovations and academic research to introduce an optimized PP implementation with asynchronous communication and dynamic chunked prefilling, effectively minimizing pipeline bubbles. It reconstructs ultra-long prompt processing into a high-throughput, computationally scalable streaming workflow. Benchmarks show that this PP maintains over 80% scaling efficiency under PP4 scaling. When Qwen3-235B-A22B-FP8 is deployed with PP8 on H20, ultra-long prompt TTFT is reduced by 81%.
Background: Why Choose Pipeline Parallelism?
To validate the necessity of PP in long-context prefilling, we compare Tensor Parallelism (TP) and Context Parallelism (CP). Through theoretical and empirical analysis of communication volume, bubble ratio, and implementation complexity, PP occupies a unique optimal position in multi-node scaling.
1. Communication Volume and Scalability Analysis
The main bottleneck in distributed inference scaling is inter-device communication. As model depth and sequence length increase, data transfer volume becomes limiting, especially in large-scale multi-node deployments.
Assume B is Batch Size (often 1 for ultra-long contexts), S is total sequence length, H is hidden state dimension, L is total layers, M is micro-batch size, and activation precision is FP8 (1 byte). Communication volume analysis for different strategies:
- TP: Splits weight tensors within a single layer, requiring synchronization after Attention Block and MLP Block. All-Reduce communication grows linearly with layers, bandwidth-bound.
Commu Volume(TP) ≈ 4 · B · S · H · L · bytes(Ring All-Reduce operates on 2x data per operation, 2 All-Reduces per layer). - CP: Each layer requires All-Gather to aggregate KV states, high latency in bandwidth-limited environments.
Commu Volume(CP) ≈ 2 · B · S · H_KV · L · bytes(Ring-Attention scheme, H_KV is smaller with GQA). - PP: Only transmits at pipeline stage boundaries, using P2P instead of collective operations. Communication frequency determined by stage count P (P ≪ L).
Commu Volume(PP) = B · S · H · (P-1) · bytes(Communication volume reduced by nearly an order of magnitude in multi-node settings).
2. Bubble Ratio Trade-offs
While PP optimizes communication, it introduces pipeline bubbles (device idle waiting). TP/CP have theoretical bubble rates of 0, with all devices computing in parallel.
PP bubble ratio: Bubble Ratio = (P - 1) / (P - 1 + M). In long-context prefilling where M ≫ P, the ratio is minimal, with communication benefits far exceeding losses. The Performance Impact section will evaluate Strong Scaling Efficiency.
Pure high-order PP is not recommended (bubbles increase with P); it should be combined with intra-node bubble-free TP/CP (high-bandwidth NVLink).
3. Implementation Complexity and Architecture Generality
Open-source systems value simple implementation and generality.
- TP: Easy to implement and widely supported, but large-scale TP is incompatible with quantization (MoE FFN weights), limiting multi-node deployment.
- CP: Complex, requiring invasive modifications to attention mechanisms (e.g., Ring Attention).
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接