Partnering with SGLang: Best Practices for Efficiently Deploying DeepSeek-R1 on H20-96G

Introduction

Deploying large-scale Mixture-of-Experts (MoE) models like DeepSeek-R1 requires achieving a delicate balance between latency, throughput, and cost. This challenge is particularly pronounced on performance-asymmetric hardware such as H20 GPUs, which have high memory bandwidth but lower compute throughput. We designed a serving stack that leverages H20's cost advantages while meeting the strict SLAs of high-end GPUs.

This article outlines the best practices for achieving this goal, including hardware-aware deployment strategies that deviate from traditional approaches, along with a series of system and kernel-level optimizations:

  • Hardware-aware parallelization: Single-node TP-8 for prefill phase and small-scale EP-16 for decode phase, meeting latency targets while reducing failure domains.
  • Kernel-level optimizations: FlashMLA-FP8 and DeepGEMM swapAB to enhance H20 compute throughput.
  • Scheduling and load balancing: Single-Batch Overlap (SBO) improves small-batch throughput, while asynchronous Expert Affinity Load Balancer reduces cross-node communication.
  • Lightweight observability: Diagnostic stack designed specifically for distributed MoE serving to quickly identify bottlenecks.

Experiments show that using our strategies, each node achieves 16.5k input tokens/s and 5.7k output tokens/s on 4096 token input sequences. This represents SOTA performance on H20 and is to our knowledge the first comprehensive study covering deployment, optimization, and large-scale industrial practices.

H20 Challenges

H20's Importance

H20 GPUs are readily available, enabling Ant Group to build ultra-large-scale clusters. Even modest throughput improvements can result in significant daily cost savings.

H20 vs. H800 Comparison

SpecificationH20-96GH800-80G
FP8 Compute296 TFLOPS1979 TFLOPS
FP16/BF16 Compute148 TFLOPS989 TFLOPS
Memory Capacity96 GB80 GB
Memory Bandwidth4000 GB/s3352 GB/s
NVLink Bandwidth900 GB/s400 GB/s
RDMA NIC Bandwidth4 × 400 Gb/s8 × 400 Gb/s

H20 features larger memory (96 GB), higher memory bandwidth (4000 GB/s), and over 2x NVLink bandwidth (900 GB/s), but has weaker compute performance and lower RDMA NIC bandwidth. Inference, especially the decode phase, is often memory-bound, where H20's high memory bandwidth and capacity advantages are significant. We design optimizations accordingly to maximize inference throughput.

Solution: Optimizations and Strategies on H20

Deployment Strategy

Deployment Strategy Diagram

Prefill

  • SLA: Prefill is compute-intensive, and multi-node DP+EP increases time-to-first-token (TTFT), violating SLAs. Single-node TP keeps TTFT within targets.
  • Elastic scaling: Prefill needs to scale with KV cache; single-node TP simplifies resource and cache management.

Decode

  • Hardware characteristics: H20 trades compute for larger memory and higher NVLink bandwidth (compared to H800), efficiently utilizing KV cache while placing MoE communication on high-bandwidth NVLink.
  • Failure domain: Small-scale EP configuration limits the impact of decode or GPU failures. EP high availability (HA) is not yet mature, making small EP more reliable.

Optimizations

Prefill

Prefill Optimization Overview
Observations
  • MLA is more costly than MHA for long sequences.
  • MoE latency is unexpectedly high despite low compute volume.
  • embed/mlp all reduce + RMSNorm + fused_qkv_a_proj_with_mqa introduces redundant communication and computation in TP.
Solutions
  • MHA/MLA: Introduce tunable parameter se = extend × (extend + prefix) to select MHA or MLA based on batch size and sequence length.
  • MoE: Optimize b_scale computation, restructure down proj input access with TMA, and tune configuration based on real expert distribution.
  • TP optimization: Optimize embed/mlp reduce scatter + RMSNorm + fused_qkv_a_proj_with_mqa + all gather to reduce computation and communication.

Decode

Load Balancing
Expert Affinity EPLB
Expert Affinity EPLB Diagram

Standard EPLB only balances intra-GPU load, ignoring inter-expert correlations, often scattering frequently co-activated experts across nodes, increasing cross-node communication. We extend EPLB to track top-k expert co-activations and build an expert affinity matrix, with post-adjustment ensuring highly co-activated experts stay on the same node, improving performance by ~5% over the baseline.

Asynchronous Dynamic Load Adjustment
Asynchronous EPLB Diagram

Static EPLB tightly couples load balancing with inference; migration blocks inference causing delays. We decouple them to run in parallel, using a layered transfer strategy to minimize migration impact, achieving performance matching or exceeding static EPLB while maintaining >70% load balance ratio.

Computation
FP8 MLA

BF16 FlashMLA performs well but has optimization room. We implement end-to-end FP8 attention on Hopper (SM90), using TMA for memory transfer and WGMMA for computation. Two warp groups pipeline QK^T and PV, reducing shared memory pressure and overlapping compute with memory. This achieves ~70% speedup over BF16 and ~5% additional gain over previous FP8 versions.

SwapAB GEMM
SwapAB GEMM Diagram

Hopper WGMMA PTX constraints require N to be multiples of 8 and fix M at 64, causing coarse-grained tiling waste. We introduce swapAB, mapping M dimension to N, enabling finer-grained BLOCK_M (32), improving throughput for variable M workloads in MoE.

SBO (Single-Batch-Overlap)
Why Not TBO

TBO Decode on H20 has limited benefits: Hopper WGMMA fixes block_m at 64, causing redundant MLP GEMM for small batches; for large batches (64/128), low-compute hardware cannot meet TPOT SLAs.

How SBO Works
SBO Diagram

To improve Decode throughput without violating SLAs, we adopt SBO, modifying DeepEP and DeepGEMM. The design is based on communication-computation alignment granularity. We observe that during communication overlap, token packets arrive at receivers out of order due to NIC and other factors.