SGLang and NVIDIA Partner to Accelerate InferenceMAX Benchmark and GB200 Performance

Feb 4, 2026 1,059 Views - Read Source LMSYS

LMSYS SGLang NVIDIA Blackwell InferenceMAX GB200 MoE优化

Deep Collaboration between SGLang and NVIDIA

SGLang and NVIDIA teams have been collaborating long-term, continuously delivering inference optimizations and system-level improvements to ensure SGLang framework's exceptional performance. Recently, the collaboration has focused on NVIDIA Blackwell architecture, NVIDIA's latest datacenter GPU. By leveraging Blackwell's key features such as FP8 attention, NVFP4 MoE, and PD-Disaggregated Expert Parallelism architecture, SGLang achieves breakthrough performance at high throughput. On the NVIDIA GB200 NVL72 system, SGLang delivers an impressive 26k input tokens/second per GPU for prefill and 13k output tokens/second for decode for the DeepSeek R1 model, marking new heights in cost and energy efficiency at scale.

This joint achievement is further demonstrated in SGLang's performance on the newly released SemiAnalysis InferenceMAX v1 benchmark. InferenceMAX is a continuous benchmarking framework that runs inference tests across different input/output configurations and updates results daily.

When running DeepSeek R1 models on Blackwell GPUs (GB200/B200), SGLang achieves up to 4x performance improvement compared to the previous generation Hopper GPUs (H100/H200), with this advantage evident across the entire Pareto frontier (evaluating the key tradeoff between latency and throughput).

SemiAnalysis InferenceMAX Benchmark

LLM inference performance is driven by two main pillars: hardware and software. Hardware innovations bring step-function improvements, while software evolves daily, providing continuous performance gains. SemiAnalysis InferenceMAX™ benchmark aims to capture this dynamic, running benchmark suites on hundreds of chips nightly, tracking real-time performance of popular open-source inference frameworks and models. The public can access the live dashboard.

InferenceMAX™'s core objective is to cover the full spectrum of different GPUs, inference engines, and workloads. To ensure server configurations reflect real-world deployments, benchmark organizers require hardware vendors to submit configurations that align with their best practices.

SGLang was selected as the default inference engine for running DeepSeek models on NVIDIA and AMD hardware, demonstrating its highly specialized optimizations for these cutting-edge models.

The figure below shows results for 1k input tokens and 8k output tokens configuration, highlighting performance on Blackwell.

Figure 1: SGLang performance across different hardware platforms. (Source: https://inferencemax.ai/)

SGLang Optimizations for Large-Scale MoE Models

These performance improvements stem from deep system-level optimizations for large-scale Mixture-of-Experts (MoE) models.

Prefill-Decode Disaggregation and Large-Scale Expert Parallelism

LLM inference consists of two phases: compute-intensive Prefill (processing input prompts) and memory-intensive Decode (generating output tokens). A unified engine handling both creates inefficiencies, such as prefill batches interrupting decode streams.

SGLang addresses this through Prefill-Decode (PD) Disaggregation, separating the two phases into independent engines, enabling targeted scheduling and optimization. This architecture is crucial for efficiently implementing Large-Scale Expert Parallelism (EP), especially when using communication libraries like DeepEP. DeepEP employs different distribution patterns for prefill (high throughput) and decode (low latency), which unified engines cannot accommodate. After disaggregation, SGLang can select the optimal DeepEP mode for each phase, improving overall efficiency.

Blackwell-Specific Kernel Optimizations

Collaboration with NVIDIA enabled us to develop and integrate optimized kernels that fully leverage Blackwell's new capabilities:

FP8 Attention: KV cache uses FP8 precision, halving memory access pressure during decode and enabling faster Tensor Core instructions, speeding up attention kernels and supporting larger batches and longer sequences.
NVFP4 GEMM: MoE experts and other GEMMs use the new NVFP4 precision, reducing memory bandwidth, leveraging powerful FP4 Tensor Cores, and halving token dispatch communication traffic, freeing weight memory space to accommodate larger KV caches.
Compute-Communication Overlap: Blackwell systems' significantly improved communication bandwidth enables finer-grained overlap, efficiently hiding communication latency.
Optimized Kernels: Integrated a series of new kernels, including NVIDIA Blackwell DeepGEMM, FlashInfer's NVFP4 GEMM and FP8 attention kernels, Flash Attention CuTe, and CUTLASS MLA, all rewritten to leverage new features like TMA and cluster launch control.

For more information, please refer to the detailed technical blogs:

Figure 2: SGLang performance using Prefill-Decode Disaggregation and Expert Parallelism. (Source: https://lmsys.org/blog/2025-09-25-gb200-part-2/)

Future Collaboration

Moving forward, we will strengthen our collaboration with NVIDIA at runtime and kernel levels, continuing to optimize performance of DeepSeek v3.2, GPT-OSS, and QWen model families on the latest NVIDIA GPUs, from the compact DGX Spark to full-rack supercomputers like GB200 and GB300.

We will also work more closely with the SemiAnalysis team to make InferenceMAX benchmarks more systematic, reproducible, and reliable, and assist in validating our full-rack solutions.

Acknowledgments

Thanks to everyone in the community who contributed to this project.

NVIDIA Team: Trevor Morris, Kaixi Hou, Elfie Guo, Nicolas Castet, Faraz Khoubsirat, Ishan Dhanan, Shu Wang, Pavani Majety, Zihao Ye, Yingyi Huang, Alex Zhurkevich, Kushan Ahmadian, Pen Li, Juan Yu, Kedar Potar, Grace Ho, Lingjie Wu, Yiheng Zhang, Kyle Liang, and others

SGLang Team: Jingyi Chen, Baizhou Zhang, Jiexin Liang, Qiaolin Yu, Yineng Zhang, Ke Bao, Liangsheng Yin, Jianan Ji, Ying Sheng

SemiAnalysis Team: Dylan Patel, Kimbo Chen, Cam, and others