Update on January 28: NVIDIA just released the NVFP4 precision Nemotron 3 Nano model. The model supports SGLang out of the box and uses a new method called Quantization-Aware Distillation (QAD) to maintain accuracy under NVFP4 while achieving 4x throughput improvement on B200 compared to FP8-H100. You can download the NVFP4 checkpoint from here and run it using the NVIDIA Brev launcher.
SGLang Adds Same-Day Support for Efficient NVIDIA Nemotron 3 Nano Model
We are excited to announce that SGLang has added support for the latest high-efficiency NVIDIA Nemotron 3 Nano model on the same day as its release!
Nemotron 3 Nano, from the newly released open-source Nemotron 3 series, is a compact MoE language model that delivers industry-leading computational efficiency and accuracy, helping developers build specialized agentic AI systems.
The model is fully open source, including weights, datasets, and training recipes, allowing developers to customize, optimize, and deploy on their own infrastructure for maximum privacy and security. The figure below shows that Nemotron 3 Nano ranks in the optimal quadrant of Artificial Analysis's openness vs intelligence index chart.

TL;DR
- Architecture: Mixture of Experts (MoE) with Hybrid Transformer-Mamba architecture, supports Thinking Budget to achieve best accuracy with minimal inference tokens
- Accuracy: Leading in domains such as coding, science reasoning, math, and instruction following
- Model Size: 30B parameters with 3.6B active parameters
- Context Length: 1M
- Input/Output: Text
- Supported GPUs: NVIDIA RTX Pro 6000, DGX Spark, H100, B200
- Quick Start:
- Download weights from Hugging Face - BF16, FP8, NVFP4
- Inference with SGLang
- Technical Report for building custom optimized models
Installation and Quick Start
To simplify setup with SGLang, refer to the getting started manual, or use the NVIDIA Brev launcher.
Run the following command to install dependencies:
uv pip install sglang==0.5.6.post3.dev1278+gad1b4e472 --extra-index-url https://sgl-project.github.io/whl/nightly/
Then launch the server:
# BF16
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
# FP8
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-FP8 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
# NVFP4
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-NVFP4 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder
Once the server is running, use the following code to prompt the model:
from openai import OpenAI
# Model name used when launching the server
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16"
BASE_URL = f"http://localhost:30000/v1"
API_KEY = "EMPTY" # SGLang server requires no API key by default
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)
resp = client.chat.completions.create(
model=SERVED_MODEL_NAME,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Give me 3 bullet points about SGLang."}
],
temperature=0.6,
max_tokens=512,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)
Nemotron 3 Nano: Highest Efficiency and Leading Accuracy for Building AI Agents
Nemotron 3 Nano is based on a hybrid Mamba-Transformer architecture, replacing standard FFN layers with MoE layers and converting most attention layers to Mamba-2, achieving higher accuracy with only partial active parameters. MoE reduces computational requirements to meet the low-latency needs of real-time deployment.
Its hybrid architecture increases token throughput by up to 4x, enabling faster inference and higher accuracy. The "Thinking Budget" feature avoids unnecessary computation, reducing overthinking and ensuring lower and more predictable inference costs.

Trained on NVIDIA's curated high-quality data, Nemotron 3 Nano leads on benchmarks including SWE Bench Verified, GPQA Diamond, AIME 2025, Arena Hard v2, and IFBench, making it suitable for building AI agents in enterprise scenarios such as finance, cybersecurity, software development, and retail.

Quick Start
- Download weights from Hugging Face - BF16, FP8, NVFP4
- Use the SGLang cookbook or NVIDIA Brev launcher for inference
Further Reading
- Share your ideas and vote to shape Nemotron's future
- Subscribe to NVIDIA news, follow NVIDIA Nemotron, and get updates on LinkedIn, X, YouTube, and the Discord Nemotron channel
Acknowledgments
Thanks to all contributors for developing and integrating Nemotron 3 Nano into SGLang.
NVIDIA Team: Roi Koren, Max Xu, Netanel Haber, Tomer Bar Natan, Daniel Afrimi, Nirmal Kumar Juluru, Ann Guan, and others
SGLang Team and Community: Baizhou Zhang, Jiajun Li, Ke Bao, Mingyi Lu, Richard Chen
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接