Introduction
Deploying large-scale Mixture-of-Experts (MoE) models like DeepSeek-R1 requires achieving a delicate balance between latency, throughput, and cost. This challenge is particularly pronounced on performance-asymmetric hardware such as H20 GPUs, which have high memory bandwidth but lower compute throughput. We designed a serving stack that leverages H20's cost advantages while meeting the strict SLAs of high-end GPUs.
This article outlines the best practices for achieving this goal, including hardware-aware deployment strategies that deviate from traditional approaches, along with a series of system and kernel-level optimizations:
- Hardware-aware parallelization: Single-node TP-8 for prefill phase and small-scale EP-16 for decode phase, meeting latency targets while reducing failure domains.
- Kernel-level optimizations: FlashMLA-FP8 and DeepGEMM swapAB to enhance H20 compute throughput.
- Scheduling and load balancing: Single-Batch Overlap (SBO) improves small-batch throughput, while asynchronous Expert Affinity Load Balancer reduces cross-node communication.
- Lightweight observability: Diagnostic stack designed specifically for distributed MoE serving to quickly identify bottlenecks.
Experiments show that using our strategies, each node achieves 16.5k input tokens/s and 5.7k output tokens/s on 4096 token input sequences. This represents SOTA performance on H20 and is to our knowledge the first comprehensive study covering deployment, optimization, and large-scale industrial practices.
H20 Challenges
H20's Importance
H20 GPUs are readily available, enabling Ant Group to build ultra-large-scale clusters. Even modest throughput improvements can result in significant daily cost savings.
H20 vs. H800 Comparison
| Specification | H20-96G | H800-80G |
|---|---|---|
| FP8 Compute | 296 TFLOPS | 1979 TFLOPS |
| FP16/BF16 Compute | 148 TFLOPS | 989 TFLOPS |
| Memory Capacity | 96 GB | 80 GB |
| Memory Bandwidth | 4000 GB/s | 3352 GB/s |
| NVLink Bandwidth | 900 GB/s | 400 GB/s |
| RDMA NIC Bandwidth | 4 × 400 Gb/s | 8 × 400 Gb/s |
H20 features larger memory (96 GB), higher memory bandwidth (4000 GB/s), and over 2x NVLink bandwidth (900 GB/s), but has weaker compute performance and lower RDMA NIC bandwidth. Inference, especially the decode phase, is often memory-bound, where H20's high memory bandwidth and capacity advantages are significant. We design optimizations accordingly to maximize inference throughput.
Solution: Optimizations and Strategies on H20
Deployment Strategy
Prefill
- SLA: Prefill is compute-intensive, and multi-node DP+EP increases time-to-first-token (TTFT), violating SLAs. Single-node TP keeps TTFT within targets.
- Elastic scaling: Prefill needs to scale with KV cache; single-node TP simplifies resource and cache management.
Decode
- Hardware characteristics: H20 trades compute for larger memory and higher NVLink bandwidth (compared to H800), efficiently utilizing KV cache while placing MoE communication on high-bandwidth NVLink.
- Failure domain: Small-scale EP configuration limits the impact of decode or GPU failures. EP high availability (HA) is not yet mature, making small EP more reliable.
Optimizations
Prefill
Observations
- MLA is more costly than MHA for long sequences.
- MoE latency is unexpectedly high despite low compute volume.
embed/mlp all reduce + RMSNorm + fused_qkv_a_proj_with_mqaintroduces redundant communication and computation in TP.
Solutions
- MHA/MLA: Introduce tunable parameter
se = extend × (extend + prefix)to select MHA or MLA based on batch size and sequence length. - MoE: Optimize
b_scalecomputation, restructuredown projinput access with TMA, and tune configuration based on real expert distribution. - TP optimization: Optimize
embed/mlp reduce scatter + RMSNorm + fused_qkv_a_proj_with_mqa + all gatherto reduce computation and communication.
Decode
Load Balancing
Expert Affinity EPLB
Standard EPLB only balances intra-GPU load, ignoring inter-expert correlations, often scattering frequently co-activated experts across nodes, increasing cross-node communication. We extend EPLB to track top-k expert co-activations and build an expert affinity matrix, with post-adjustment ensuring highly co-activated experts stay on the same node, improving performance by ~5% over the baseline.
Asynchronous Dynamic Load Adjustment
Static EPLB tightly couples load balancing with inference; migration blocks inference causing delays. We decouple them to run in parallel, using a layered transfer strategy to minimize migration impact, achieving performance matching or exceeding static EPLB while maintaining >70% load balance ratio.
Computation
FP8 MLA
BF16 FlashMLA performs well but has optimization room. We implement end-to-end FP8 attention on Hopper (SM90), using TMA for memory transfer and WGMMA for computation. Two warp groups pipeline QK^T and PV, reducing shared memory pressure and overlapping compute with memory. This achieves ~70% speedup over BF16 and ~5% additional gain over previous FP8 versions.
SwapAB GEMM
Hopper WGMMA PTX constraints require N to be multiples of 8 and fix M at 64, causing coarse-grained tiling waste. We introduce swapAB, mapping M dimension to N, enabling finer-grained BLOCK_M (32), improving throughput for variable M workloads in MoE.
SBO (Single-Batch-Overlap)
Why Not TBO
TBO Decode on H20 has limited benefits: Hopper WGMMA fixes block_m at 64, causing redundant MLP GEMM for small batches; for large batches (64/128), low-compute hardware cannot meet TPOT SLAs.
How SBO Works
To improve Decode throughput without violating SLAs, we adopt SBO, modifying DeepEP and DeepGEMM. The design is based on communication-computation alignment granularity. We observe that during communication overlap, token packets arrive at receivers out of order due to NIC and other factors.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接