🚀 AutoRound Partners with SGLang: A New Era of Efficient Quantized Model Inference

Feb 4, 2026 1,011 Views - Read Source LMSYS

LMSYS AutoRound SGLang 模型量化 LLM推理后训练量化

Overview

We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. Through this integration, developers can use AutoRound's signed gradient optimization techniques to quantize large models and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.

What is AutoRound?

AutoRound is an advanced post-training quantization (PTQ) toolkit designed specifically for Large Language Models (LLMs) and Vision-Language Models (VLMs). It leverages signed gradient descent to jointly optimize weight rounding and clipping ranges, enabling low-bit quantization from INT2 to INT8 with minimal accuracy loss in most scenarios. For example, at INT2 precision, its relative accuracy is up to 2.1x higher than popular baselines; it maintains a leading advantage at INT4 precision as well. The figure below shows an overview of the AutoRound core algorithm.

AutoRound Algorithm Overview

For complete technical details, see the AutoRound paper: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs.

Despite its strong performance, AutoRound remains efficient and lightweight—in lightweight mode, it takes only 37 minutes to quantize a 72B model on a single GPU. It also supports mixed-bit tuning, lm-head quantization, as well as export to GPTQ/AWQ/GGUF formats and custom tuning recipes.

AutoRound Highlights

AutoRound not only focuses on algorithmic innovation but is also widely recognized for its comprehensive quantization engineering capabilities.

Accuracy: Provides superior accuracy at low-bit precisions

Average Accuracy on 10+ Tasks with INT4 Weights

Quantization Schemes: Supports weight-only quantization, weight & activation quantization, and dynamic/static modes for activation quantization
Mixed-Bit: Effective algorithms for generating mixed-bit/other data type schemes in minutes
Wide Compatibility:
- Supports nearly all popular LLM architectures and 10+ VLMs
- Devices: CPU, Intel GPU, CUDA
- Data types: INT2-INT8, MXFP4, NVFP4, FP8, MXFP8
Efficiency: Block-wise tuning reduces VRAM usage while maintaining high throughput and speed

Quantization Time Cost Comparison

Community Adoption: Seamless integration with SGLang, TorchAO, Transformers, and vLLM; approximately 2 million downloads from HuggingFace model repositories (e.g., Intel, OPEA, Kaitchup, fbaldassarri)
Export Formats: AutoRound, GPTQ, AWQ, GGUF, Compressed-tensor (preliminary support)

Integration Overview

SGLang provides a new generation inference runtime that supports scalable, low-latency LLM deployment. Its multi-modal, multi-GPU, and streaming execution models are exceptionally efficient for chat and agent inference tasks.

SGLang's flexible architecture now provides native quantized model loading hooks, unlocking AutoRound's full potential in deployment.

1. Quantizing with AutoRound

AutoRound automatically optimizes weight rounding and exports SGLang-compatible quantized weights.

1.1 API Usage

# for LLM
from auto_round import AutoRound
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-autoround-4bit"
# Scheme examples: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
scheme = "W4A16"
format = "auto_round"
autoround = AutoRound(model_id, scheme=scheme)
autoround.quantize_and_save(quant_path, format=format) # quantize and save

1.2 CMD Usage

auto-round \
    --model Qwen/Qwen2-VL-2B-Instruct \
    --bits 4 \
    --group_size 128 \
    --format "auto_round" \
    --output_dir ./tmp_autoround

2. Deploying with SGLang

SGLang (version >= v0.5.4.post2) directly supports AutoRound quantized models, compatible with common LLM, VLM, and MoE models, and supports inference and evaluation of mixed-bit quantized models.

2.1 OpenAI-Compatible Inference

from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

# Equivalent to terminal command:
# python3 -m sglang.launch_server --model-path Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound --host 0.0.0.0

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model-path Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound \
 --host 0.0.0.0 --log-level warning
"""
)
wait_for_server(f"http://localhost:{port}")

2.2 Offline Engine API Inference

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")

prompts = ["Hello, my name is"]
sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

More flexible configuration and deployment options await your exploration!

Quantization Roadmap

AutoRound quantization benchmark results show that accuracy remains strong at low precisions. The table below highlights its advantages in MXFP4, NVFP4, and mixed-bit quantization. Accuracy is based on the average of lambada_openai, hellaswag, piqa, winogrande, and mmlu tasks.

The future roadmap includes improving MXFP4 & NVFP4 accuracy for common models, as well as automatic mixed-bit quantization.

MXFP4 & NVFP4 Quantization (RTN as baseline, 'alg_ext' indicates experimental optimization algorithms)

MXFP4	llama3.1-8B-Instruct	Qwen2-7.5-Instruct	Phi4	Qwen3-32B
RTN	0.6212	0.6550	0.7167	0.6901
AutoRound	0.6686	0.6758	0.7247	0.7211
AutoRound+alg_ext	0.6732	0.6809	0.7225	0.7201

NVFP4	llama3.1-8B-Instruct	Qwen2-7.5-Instruct	Phi4	Qwen3-32B
RTN	0.6876	0.6906	0.7296	0.7164
AutoRound	0.6918	0.6973	0.7306	0.7306
AutoRound+alg_ext	0.6965	0.6989	0.7318	0.7295

Automatic MXFP4 & MXFP8 Mixed-Bit Quantization

Average Bits	Llama3.1-8B-I	Qwen2.5-7B-I	Qwen3-8B	Qwen3-32B
BF16	0.7076 (100%)	0.7075 (100%)	0.6764 (100%)	0.7321 (100%)
4-bit	0.6626 (93.6%)	0.6550 (92.6%)	0.6316 (93.4%)	0.6901 (94.3%)
4.5-bit	0.6808 (96.2%)	0.6776 (95.8%)	0.6550 (96.8%)	0.7176 (98.0%)
5-bit	0.6857 (96.9%)	0.6823 (96.4%)	0.6594 (97.5%)	0.7201 (98.3%)
6-bit	0.6975 (98.6%)	0.6970 (98.5%)	0.6716 (99.3%)	0.7303 (99.8%)

Conclusion

The integration of AutoRound with SGLang marks an important milestone in efficient AI model deployment. This collaboration bridges precision optimization with runtime scalability, allowing developers to transition seamlessly from quantization to real-time inference. AutoRound's signed gradient quantization maintains high fidelity even at extreme compression ratios, while SGLang's high-throughput inference engine unleashes low-bit execution potential across CPUs, GPUs, and multi-node clusters.

Looking ahead, we will expand support for advanced quantization formats, optimize kernel efficiency, and bring AutoRound quantization to a broader range of multi-modal and agent tasks. Together, AutoRound and SGLang set a new standard for intelligent, efficient, and scalable LLM deployment.