(Updated December 2)
We're excited to announce a major new feature for SGLang: native quantization support for NVIDIA Model Optimizer! This integration streamlines the entire model optimization and deployment workflow, allowing you to convert directly from full-precision models to high-performance quantized endpoints within the SGLang ecosystem.
Efficiently serving large language models (LLMs) is one of the biggest challenges in production environments. Model quantization is a key technique for reducing memory footprint and improving inference speed. Previously, this process required multi-step workflows and separate tools. Now, with our latest updates (PRs #7149, #9991, and #10154), we've completely eliminated this complexity.
The combination of Model Optimizer and SGLang's optimizations can achieve up to 2x single-GPU throughput improvements for NVFP4 and FP8 inference.
New Feature: Direct ModelOpt API in SGLang
SGLang now directly integrates NVIDIA Model Optimizer, allowing you to call its powerful quantization API from within SGLang code.
This new feature enables a streamlined three-step workflow:
- Quantize: Apply advanced quantization techniques using the SGLang-ModelOpt interface, supporting NVFP4, MXFP4, FP8, and other accelerated low-precision inference formats.
- Export: Save optimized model files that are fully compatible with the SGLang runtime.
- Deploy: Load quantized models directly in the SGLang runtime and serve on NVIDIA platforms, immediately enjoying lower latency and memory savings.
Performance Results
Models optimized through the new API deliver significant performance improvements. These optimizations stack with other NVIDIA software and hardware stack components and apply to various form factors of the latest Blackwell architecture, from DGX Spark to GB300 NVL72.
The figure above shows NVIDIA B200 single-GPU throughput vs end-to-end latency for DeepSeek-R1-0528 across multiple configurations, using Model Optimizer NVFP4 quantized models compared to native FP8 and NVFP4. (DeepSeek-R1-0528 is not yet supported in the initial release of this API)
According to the latest InferenceMAX results, Model Optimizer combined with SGLang optimizations can achieve up to 2x single-GPU throughput compared to native FP8 baseline. These performance gains will soon be available through the native integration described in this blog.
Quick Start Guide
SGLang provides an example script demonstrating the complete Model Optimizer quantization and export process. Make sure to install nvidia-modelopt and accelerate in your SGLang environment, then run the following code snippet:
import sglang as sgl
from sglang.srt.configs.device_config import DeviceConfig
from sglang.srt.configs.load_config import LoadConfig
from sglang.srt.configs.model_config import ModelConfig
from sglang.srt.model_loader.loader import get_model_loader
# Configure model, quantize with ModelOpt, and export
model_config = ModelConfig(
model_path="Qwen/Qwen3-8B",
quantization="modelopt_fp8", # or "modelopt_fp4"
trust_remote_code=True,
)
load_config = LoadConfig(
modelopt_export_path="./quantized_qwen3_8b_fp8",
modelopt_checkpoint_save_path="./checkpoint.pth", # Optional, pseudo-quantized checkpoint
)
device_config = DeviceConfig(device="cuda")
# Load and quantize model (export happens automatically)
model_loader = get_model_loader(load_config, model_config)
quantized_model = model_loader.load_model(
model_config=model_config,
device_config=device_config,
)
After quantization and export, you can deploy the model with SGLang:
# Deploy the exported quantized model
python -m sglang.launch_server \
--model-path ./quantized_qwen3_8b_fp8 \
--quantization modelopt \
--port 30000 --host 0.0.0.0
Or using the Python API:
import sglang as sgl
from transformers import AutoTokenizer
def main():
# Deploy the exported ModelOpt quantized model
llm = sgl.Engine(
model_path="./quantized_qwen3_8b_fp8",
quantization="modelopt"
)
# Format prompts using Qwen3-8B chat template
tokenizer = AutoTokenizer.from_pretrained("./quantized_qwen3_8b_fp8")
messages = [
[{"role": "user", "content": "Hello, how are you?"}],
[{"role": "user", "content": "What is the capital of France?"}]
]
prompts = [
tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
for m in messages
]
# Run inference
sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 512}
outputs = llm.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt: {prompts[i]}")
print(f"Output: {output['text']}")
if __name__ == "__main__":
main()
Conclusion
This native Model Optimizer integration reinforces SGLang's commitment as a simple and powerful platform for LLM inference. We will continue to bridge the gap between high-performance model optimization and deployment.
We look forward to the performance improvements you'll achieve with this new feature! Please visit our GitHub repository to pull the latest version and try it out.
Welcome to join the dedicated Slack channel #modelopt to discuss modelopt, quantization, and low-precision numerical topics! If you haven't joined the workspace, please join here first.
Acknowledgments
NVIDIA Team: Zhiyu Cheng, Jingyu Xin, Huizi Mao, Eduardo Alvarez, Pen Chung Li, Omri Almog
SGLang Team and Community: Qiaolin Yu, Xinyuan Tong
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接