SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems

Mar 12, 2026 1,415 Views - Read Source LMSYS

LMSYS SGLang Nemotron 3 Super 多代理系统 NVIDIA MoE架构

SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems

NVIDIA Nemotron TeamMarch 11, 2026

We are excited to announce that SGLang supports NVIDIA Nemotron 3 Super on Day 0.

Nemotron 3 Super is a leading open model in the Nemotron 3 family, built for running many collaborating agents together. Agentic systems that chain planning, reasoning, and tools produce far more tokens than single-turn chat; they also need strong reasoning on every step.

Nemotron 3 Super is a 120B-parameter hybrid MoE that activates only 12B parameters per forward pass, giving you leading accuracy for coding, tool calling, and instruction following at a fraction of the cost—plus a 1M-token context so agents keep conversation and plan state in view across long workflows.

Artificial Analysis chart showing Nemotron 3 Super leading on intelligence vs. openness when compared to popular open models of similar size

As you can see in the chart above, Nemotron 3 Super leads on the Artificial Analysis Openness index. When compared to other open models, Nemotron is fully open with open-weights, datasets, and recipes so developers can easily customize, optimize, and deploy on their infrastructure for maximum privacy and security.

In this post we walk through installing SGLang and serving Nemotron 3 Super for inference.

About Nemotron3 Super

Architecture: Mixture of Experts (MoE) with Hybrid Transformer-Mamba Architecture
- Highest throughput efficiency in its size category and up to 5x higher throughput compared to previous Nemotron Super model (Llama Nemotron Super 1.5)
- Multi-Token Prediction (MTP) : By predicting several future tokens simultaneously in a single forward pass, MTP drastically accelerates the generation of long-form text
- Supports Thinking Budget for optimal accuracy with minimum reasoning token generation
Accuracy: Leading accuracy on Artificial Analysis Intelligence Index in its size category
- Up to 2x higher accuracy on Artificial Analysis Intelligence Index compared to previous Nemotron Super model.
- Latent MoE enables calling 4 experts for the inference cost of only one
Model size: 120B total parameters, 12B active parameters
Context length: up to 1M
Model I/O: Text in, text out
Supported GPUs: B200, H100, H200, DGX Spark, RTX 6000
Get started:
- Download model weights from Hugging Face - BF16, FP8 and NVFP4
- Run with SGLang for inference
- Technical report to build custom, optimized models with Nemotron techniques.

Installation and Quick Start

For an easier setup with SGLang, refer to our getting started cookbook, available here or through NVIDIA Brev launchable.

Run the command below to install dependencies:

pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

We can then serve this model. The command below is configured for a 4xH200 setup. Refer to the cookbooks for detailed instructions

```bash python3 -m sglang.launch_server \ --model-path nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \ --host 0.0.0.0 \ --port 5000 \ --trust-remote-code \ --tp 4 \ --tool-call-parser qwen3_coder \ --reasoning-parser nemotron_3

Once the server is up and running, you can prompt the model using the below code snippets:

from openai import OpenAI SERVED_MODEL_NAME = "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16" BASE_URL = f"http://localhost:5000/v1" API_KEY = "EMPTY" client = OpenAI(base_url=BASE_URL, api_key=API_KEY) resp = client.chat.completions.create( model=SERVED_MODEL_NAME, messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Give me 3 bullet points about SGLang."} ], temperature=0.6, max_tokens=512, ) print("Reasoning:", resp.choices[0].message.reasoning_content, "\nContent:", resp.choices[0].message.content)

Nemotron 3 Super is ideal for multi-agent and reasoning workloads

Artificial Analysis chart showing Nemotron 3 Super leading on intelligence vs. efficiency when compared to popular open models of similar size

As you can see in the chart above, the model achieves leading accuracy with higher efficiency on Artificial analysis benchmarks making it a strong choice for multi-agent systems that need both efficiency and capability.

The 1M-token context is built for long-horizon agent work: agents can keep full conversation history and plan state in context, and RAG pipelines can supply large document sets in one shot. That reduces fragmentation and goal drift in multi-step workflows.

Together, this makes Super a strong choice for orchestrating and running many agents on a single node—from code generation and debugging to research summarization, alert triage, and document analysis.

Get Started

Nemotron 3 Super helps you build scalable, cost-efficient multi-agent AI with high accuracy. With open weights, datasets, and recipes, you get full transparency and the flexibility to fine-tune and deploy on your own infrastructure, from workstation to cloud.

Ready to run multi-agent AI at scale?

Download Nemotron 3 Super model weights from Hugging Face - BF16, FP8 and NVFP4
Run with SGLang for inference using the cookbook and through Brev launchable
Read the Nemotron 3 Super technical report

Acknowledgement

Thanks to everyone who contributed to bringing Nemotron 3 Super to SGLang.

NVIDIA: Nirmal Kumar Juluru, Anusha Pant, Max Xu, Daniel Afrimi, Shahar Mor, Roi Koren, Ann Guan and many more SGLang team and community: Baizhou Zhang, Jiajun Li, Ke Bao, Lingyan Hao, Mingyi Lu

This article is from LMSYS blog, translated in full by Winzheng (winzheng.com). Click here to view the original When republishing the translation, please credit the source. Thank you!

SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems

SGLang Adds Day-0 Support for NVIDIA Nemotron 3 Super for building High-Efficiency Multi-Agent Systems

About Nemotron3 Super

Installation and Quick Start

Nemotron 3 Super is ideal for multi-agent and reasoning workloads

Get Started

Acknowledgement

Related Reviews

LMSYS Agent-Assisted SGLang Development: An Initial Exploration

LMSYS Optimizing GLM4-MoE for Production: 65% Faster TTFT with SGLang

LMSYS Win on TCO: How AMD Instinct™ MI355X Achieves Cost-Competitive Distributed Inference Through SGLang with MoRI

LMSYS Heterogeneous CPU + GPU EPD Disaggregation to Boost VLM Serving