SGLang Adds Same-Day Support for Efficient Open-Source Nemotron 3 Nano Mixed MoE Model

Update on January 28: NVIDIA just released the NVFP4 precision Nemotron 3 Nano model. The model supports SGLang out of the box and uses a new method called Quantization-Aware Distillation (QAD) to maintain accuracy under NVFP4 while achieving 4x throughput improvement on B200 compared to FP8-H100. You can download the NVFP4 checkpoint from here and run it using the NVIDIA Brev launcher.

SGLang Adds Same-Day Support for Efficient NVIDIA Nemotron 3 Nano Model

We are excited to announce that SGLang has added support for the latest high-efficiency NVIDIA Nemotron 3 Nano model on the same day as its release!

Nemotron 3 Nano, from the newly released open-source Nemotron 3 series, is a compact MoE language model that delivers industry-leading computational efficiency and accuracy, helping developers build specialized agentic AI systems.

The model is fully open source, including weights, datasets, and training recipes, allowing developers to customize, optimize, and deploy on their own infrastructure for maximum privacy and security. The figure below shows that Nemotron 3 Nano ranks in the optimal quadrant of Artificial Analysis's openness vs intelligence index chart.

NVIDIA Nemotron 3 Nano ranks in the best quadrant in Artificial Analysis's openness vs intelligence index chart
NVIDIA Nemotron 3 Nano sets a new standard for open-source AI

TL;DR

  • Architecture: Mixture of Experts (MoE) with Hybrid Transformer-Mamba architecture, supports Thinking Budget to achieve best accuracy with minimal inference tokens
  • Accuracy: Leading in domains such as coding, science reasoning, math, and instruction following
  • Model Size: 30B parameters with 3.6B active parameters
  • Context Length: 1M
  • Input/Output: Text
  • Supported GPUs: NVIDIA RTX Pro 6000, DGX Spark, H100, B200
  • Quick Start:

Installation and Quick Start

To simplify setup with SGLang, refer to the getting started manual, or use the NVIDIA Brev launcher.

Run the following command to install dependencies:

uv pip install sglang==0.5.6.post3.dev1278+gad1b4e472 --extra-index-url https://sgl-project.github.io/whl/nightly/

Then launch the server:

# BF16
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder

# FP8
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-FP8 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder

# NVFP4
python3 -m sglang.launch_server --model-path nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-NVFP4 --trust-remote-code --reasoning-parser nano_v3 --tool-call-parser qwen3_coder

Once the server is running, use the following code to prompt the model:

from openai import OpenAI

# Model name used when launching the server
SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-Nano-3-30B-A3B-BF16"

BASE_URL = f"http://localhost:30000/v1"
API_KEY = "EMPTY"  # SGLang server requires no API key by default

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

resp = client.chat.completions.create(
    model=SERVED_MODEL_NAME,
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Give me 3 bullet points about SGLang."}
    ],
    temperature=0.6,
    max_tokens=512,
)
print(resp.choices[0].message.reasoning_content, resp.choices[0].message.content)

Nemotron 3 Nano: Highest Efficiency and Leading Accuracy for Building AI Agents

Nemotron 3 Nano is based on a hybrid Mamba-Transformer architecture, replacing standard FFN layers with MoE layers and converting most attention layers to Mamba-2, achieving higher accuracy with only partial active parameters. MoE reduces computational requirements to meet the low-latency needs of real-time deployment.

Its hybrid architecture increases token throughput by up to 4x, enabling faster inference and higher accuracy. The "Thinking Budget" feature avoids unnecessary computation, reducing overthinking and ensuring lower and more predictable inference costs.

Nemotron 3 Nano provides higher throughput and leading accuracy among open-source inference models
Nemotron 3 Nano offers higher throughput and leading accuracy among open-source inference models

Trained on NVIDIA's curated high-quality data, Nemotron 3 Nano leads on benchmarks including SWE Bench Verified, GPQA Diamond, AIME 2025, Arena Hard v2, and IFBench, making it suitable for building AI agents in enterprise scenarios such as finance, cybersecurity, software development, and retail.

Nemotron 3 Nano leads on various academic benchmarks for open-source small inference models
Nemotron 3 Nano leads open-source small inference models on popular academic benchmarks

Quick Start

Further Reading

Acknowledgments

Thanks to all contributors for developing and integrating Nemotron 3 Nano into SGLang.

NVIDIA Team: Roi Koren, Max Xu, Netanel Haber, Tomer Bar Natan, Daniel Afrimi, Nirmal Kumar Juluru, Ann Guan, and others

SGLang Team and Community: Baizhou Zhang, Jiajun Li, Ke Bao, Mingyi Lu, Richard Chen