SGLang Inference Acceleration: Native Integration with NVIDIA Model Optimizer for Seamless Quantized Deployment

Feb 4, 2026 937 Views - Read Source LMSYS

LMSYS SGLang NVIDIA Model Optimizer 模型量化推理优化 LLM部署

(Updated December 2)

We're excited to announce a major new feature for SGLang: native quantization support for NVIDIA Model Optimizer! This integration streamlines the entire model optimization and deployment workflow, allowing you to convert directly from full-precision models to high-performance quantized endpoints within the SGLang ecosystem.

Efficiently serving large language models (LLMs) is one of the biggest challenges in production environments. Model quantization is a key technique for reducing memory footprint and improving inference speed. Previously, this process required multi-step workflows and separate tools. Now, with our latest updates (PRs #7149, #9991, and #10154), we've completely eliminated this complexity.

The combination of Model Optimizer and SGLang's optimizations can achieve up to 2x single-GPU throughput improvements for NVFP4 and FP8 inference.

New Feature: Direct ModelOpt API in SGLang

SGLang now directly integrates NVIDIA Model Optimizer, allowing you to call its powerful quantization API from within SGLang code.

This new feature enables a streamlined three-step workflow:

Quantize: Apply advanced quantization techniques using the SGLang-ModelOpt interface, supporting NVFP4, MXFP4, FP8, and other accelerated low-precision inference formats.
Export: Save optimized model files that are fully compatible with the SGLang runtime.
Deploy: Load quantized models directly in the SGLang runtime and serve on NVIDIA platforms, immediately enjoying lower latency and memory savings.

Performance Results

Models optimized through the new API deliver significant performance improvements. These optimizations stack with other NVIDIA software and hardware stack components and apply to various form factors of the latest Blackwell architecture, from DGX Spark to GB300 NVL72.

NVIDIA B200 single-GPU throughput vs end-to-end latency: DeepSeek-R1-0528 across multiple configurations comparing Model Optimizer NVFP4 quantized models against FP8 and NVFP4 (initial API does not yet support DeepSeek-R1-0528)

The figure above shows NVIDIA B200 single-GPU throughput vs end-to-end latency for DeepSeek-R1-0528 across multiple configurations, using Model Optimizer NVFP4 quantized models compared to native FP8 and NVFP4. (DeepSeek-R1-0528 is not yet supported in the initial release of this API)

According to the latest InferenceMAX results, Model Optimizer combined with SGLang optimizations can achieve up to 2x single-GPU throughput compared to native FP8 baseline. These performance gains will soon be available through the native integration described in this blog.

Quick Start Guide

SGLang provides an example script demonstrating the complete Model Optimizer quantization and export process. Make sure to install nvidia-modelopt and accelerate in your SGLang environment, then run the following code snippet:

import sglang as sgl
from sglang.srt.configs.device_config import DeviceConfig
from sglang.srt.configs.load_config import LoadConfig
from sglang.srt.configs.model_config import ModelConfig
from sglang.srt.model_loader.loader import get_model_loader

# Configure model, quantize with ModelOpt, and export
model_config = ModelConfig(
    model_path="Qwen/Qwen3-8B",
    quantization="modelopt_fp8",  # or "modelopt_fp4"
    trust_remote_code=True,
)

load_config = LoadConfig(
    modelopt_export_path="./quantized_qwen3_8b_fp8",
    modelopt_checkpoint_save_path="./checkpoint.pth",  # Optional, pseudo-quantized checkpoint
)
device_config = DeviceConfig(device="cuda")

# Load and quantize model (export happens automatically)
model_loader = get_model_loader(load_config, model_config)
quantized_model = model_loader.load_model(
    model_config=model_config,
    device_config=device_config,
)

After quantization and export, you can deploy the model with SGLang:

# Deploy the exported quantized model
python -m sglang.launch_server \
   --model-path ./quantized_qwen3_8b_fp8 \
   --quantization modelopt \
   --port 30000 --host 0.0.0.0

Or using the Python API:

import sglang as sgl
from transformers import AutoTokenizer

def main():
   # Deploy the exported ModelOpt quantized model
   llm = sgl.Engine(
      model_path="./quantized_qwen3_8b_fp8",
      quantization="modelopt"
   )

   # Format prompts using Qwen3-8B chat template
   tokenizer = AutoTokenizer.from_pretrained("./quantized_qwen3_8b_fp8")

   messages = [
       [{"role": "user", "content": "Hello, how are you?"}],
       [{"role": "user", "content": "What is the capital of France?"}]
   ]

   prompts = [
       tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
       for m in messages
   ]

   # Run inference
   sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 512}
   outputs = llm.generate(prompts, sampling_params)

   for i, output in enumerate(outputs):
      print(f"Prompt: {prompts[i]}")
      print(f"Output: {output['text']}")

if __name__ == "__main__":
    main()

Conclusion

This native Model Optimizer integration reinforces SGLang's commitment as a simple and powerful platform for LLM inference. We will continue to bridge the gap between high-performance model optimization and deployment.

We look forward to the performance improvements you'll achieve with this new feature! Please visit our GitHub repository to pull the latest version and try it out.

Welcome to join the dedicated Slack channel #modelopt to discuss modelopt, quantization, and low-precision numerical topics! If you haven't joined the workspace, please join here first.

Acknowledgments

NVIDIA Team: Zhiyu Cheng, Jingyu Xin, Huizi Mao, Eduardo Alvarez, Pen Chung Li, Omri Almog

SGLang Team and Community: Qiaolin Yu, Xinyuan Tong