Overview
We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. Through this integration, developers can use AutoRound's signed gradient optimization techniques to quantize large models and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.
What is AutoRound?
AutoRound is an advanced post-training quantization (PTQ) toolkit designed specifically for Large Language Models (LLMs) and Vision-Language Models (VLMs). It leverages signed gradient descent to jointly optimize weight rounding and clipping ranges, enabling low-bit quantization from INT2 to INT8 with minimal accuracy loss in most scenarios. For example, at INT2 precision, its relative accuracy is up to 2.1x higher than popular baselines; it maintains a leading advantage at INT4 precision as well. The figure below shows an overview of the AutoRound core algorithm.

AutoRound Algorithm Overview
For complete technical details, see the AutoRound paper: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs.
Despite its strong performance, AutoRound remains efficient and lightweight—in lightweight mode, it takes only 37 minutes to quantize a 72B model on a single GPU. It also supports mixed-bit tuning, lm-head quantization, as well as export to GPTQ/AWQ/GGUF formats and custom tuning recipes.
AutoRound Highlights
AutoRound not only focuses on algorithmic innovation but is also widely recognized for its comprehensive quantization engineering capabilities.
- Accuracy: Provides superior accuracy at low-bit precisions

Average Accuracy on 10+ Tasks with INT4 Weights
- Quantization Schemes: Supports weight-only quantization, weight & activation quantization, and dynamic/static modes for activation quantization
- Mixed-Bit: Effective algorithms for generating mixed-bit/other data type schemes in minutes
- Wide Compatibility:
- Supports nearly all popular LLM architectures and 10+ VLMs
- Devices: CPU, Intel GPU, CUDA
- Data types: INT2-INT8, MXFP4, NVFP4, FP8, MXFP8
- Efficiency: Block-wise tuning reduces VRAM usage while maintaining high throughput and speed

Quantization Time Cost Comparison
- Community Adoption: Seamless integration with SGLang, TorchAO, Transformers, and vLLM; approximately 2 million downloads from HuggingFace model repositories (e.g., Intel, OPEA, Kaitchup, fbaldassarri)
- Export Formats: AutoRound, GPTQ, AWQ, GGUF, Compressed-tensor (preliminary support)
Integration Overview
SGLang provides a new generation inference runtime that supports scalable, low-latency LLM deployment. Its multi-modal, multi-GPU, and streaming execution models are exceptionally efficient for chat and agent inference tasks.
SGLang's flexible architecture now provides native quantized model loading hooks, unlocking AutoRound's full potential in deployment.
1. Quantizing with AutoRound
AutoRound automatically optimizes weight rounding and exports SGLang-compatible quantized weights.
1.1 API Usage
# for LLM
from auto_round import AutoRound
model_id = "meta-llama/Llama-3.2-1B-Instruct"
quant_path = "Llama-3.2-1B-Instruct-autoround-4bit"
# Scheme examples: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
scheme = "W4A16"
format = "auto_round"
autoround = AutoRound(model_id, scheme=scheme)
autoround.quantize_and_save(quant_path, format=format) # quantize and save1.2 CMD Usage
auto-round \
--model Qwen/Qwen2-VL-2B-Instruct \
--bits 4 \
--group_size 128 \
--format "auto_round" \
--output_dir ./tmp_autoround2. Deploying with SGLang
SGLang (version >= v0.5.4.post2) directly supports AutoRound quantized models, compatible with common LLM, VLM, and MoE models, and supports inference and evaluation of mixed-bit quantized models.
2.1 OpenAI-Compatible Inference
from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process
# Equivalent to terminal command:
# python3 -m sglang.launch_server --model-path Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound --host 0.0.0.0
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model-path Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound \
--host 0.0.0.0 --log-level warning
"""
)
wait_for_server(f"http://localhost:{port}")2.2 Offline Engine API Inference
import sglang as sgl
llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")
prompts = ["Hello, my name is"]
sampling_params = {"temperature": 0.6, "top_p": 0.95}
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")More flexible configuration and deployment options await your exploration!
Quantization Roadmap
AutoRound quantization benchmark results show that accuracy remains strong at low precisions. The table below highlights its advantages in MXFP4, NVFP4, and mixed-bit quantization. Accuracy is based on the average of lambada_openai, hellaswag, piqa, winogrande, and mmlu tasks.
The future roadmap includes improving MXFP4 & NVFP4 accuracy for common models, as well as automatic mixed-bit quantization.
- MXFP4 & NVFP4 Quantization (RTN as baseline, 'alg_ext' indicates experimental optimization algorithms)
| MXFP4 | llama3.1-8B-Instruct | Qwen2-7.5-Instruct | Phi4 | Qwen3-32B |
|---|---|---|---|---|
| RTN | 0.6212 | 0.6550 | 0.7167 | 0.6901 |
| AutoRound | 0.6686 | 0.6758 | 0.7247 | 0.7211 |
| AutoRound+alg_ext | 0.6732 | 0.6809 | 0.7225 | 0.7201 |
| NVFP4 | llama3.1-8B-Instruct | Qwen2-7.5-Instruct | Phi4 | Qwen3-32B |
|---|---|---|---|---|
| RTN | 0.6876 | 0.6906 | 0.7296 | 0.7164 |
| AutoRound | 0.6918 | 0.6973 | 0.7306 | 0.7306 |
| AutoRound+alg_ext | 0.6965 | 0.6989 | 0.7318 | 0.7295 |
- Automatic MXFP4 & MXFP8 Mixed-Bit Quantization
| Average Bits | Llama3.1-8B-I | Qwen2.5-7B-I | Qwen3-8B | Qwen3-32B |
|---|---|---|---|---|
| BF16 | 0.7076 (100%) | 0.7075 (100%) | 0.6764 (100%) | 0.7321 (100%) |
| 4-bit | 0.6626 (93.6%) | 0.6550 (92.6%) | 0.6316 (93.4%) | 0.6901 (94.3%) |
| 4.5-bit | 0.6808 (96.2%) | 0.6776 (95.8%) | 0.6550 (96.8%) | 0.7176 (98.0%) |
| 5-bit | 0.6857 (96.9%) | 0.6823 (96.4%) | 0.6594 (97.5%) | 0.7201 (98.3%) |
| 6-bit | 0.6975 (98.6%) | 0.6970 (98.5%) | 0.6716 (99.3%) | 0.7303 (99.8%) |
Conclusion
The integration of AutoRound with SGLang marks an important milestone in efficient AI model deployment. This collaboration bridges precision optimization with runtime scalability, allowing developers to transition seamlessly from quantization to real-time inference. AutoRound's signed gradient quantization maintains high fidelity even at extreme compression ratios, while SGLang's high-throughput inference engine unleashes low-bit execution potential across CPUs, GPUs, and multi-node clusters.
Looking ahead, we will expand support for advanced quantization formats, optimize kernel efficiency, and bring AutoRound quantization to a broader range of multi-modal and agent tasks. Together, AutoRound and SGLang set a new standard for intelligent, efficient, and scalable LLM deployment.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接