Nemotron 3 Nano 30B: lightweight, deep reasoning for high-throughput agents
Hybrid Mamba‑2 + Transformer with sparse MoE and a 1M context window, refined by SFT + RLVR + RLHF. 31.6B total params with ~3.6B active per token delivers up to 4× the inference efficiency of the previous generation.
Context
1M tokens
Throughput uplift
≈3.3×
vs Qwen3-30B @H200
Reasoning control
ON/OFF
Configurable thinking budget
License
NVIDIA OML
Hybrid Mamba-Transformer
MoE · 6/128 Expert routing
Context training
512k CPT + 4k mix
Serving stack
vLLM / SGLang
Control
Configurable thinking budget
Endpoints
OpenRouter · build.nvidia.com
TL;DR
Highlights
Architecture
Mamba-2 + Transformer + sparse MoE
Hybrid sequence modeling with GQA attention; MoE routes 6/128 experts for long context and precise reasoning.
Efficiency
High throughput, low latency
On a single H200 (8K→16K), ~3.3× Qwen3‑30B and 2.2× GPT‑OSS‑20B throughput—no slowdowns under agent fan-out.
Reasoning control
Toggle + budget
Supports Reasoning ON/OFF with a configurable thinking-token budget for predictable cost.
Context
1,000,000 tokens
512k CPT plus 4k mix extends context for long-horizon workflows, retrieval, and persistent memory.
Openness
Fully open stack
Open weights, data, recipes, and code under NVIDIA Open Model License for integration and reproduction.
Use cases
Reasoning · Tools · Agents
Strong accuracy across math, code, tool use, and multi-step agent tasks built for high-frequency calls.
Performance & architecture
Big-model reasoning with smaller active params
31.6B total params with ~3.6B active per token; MoE keeps inference light while boosting reasoning.
Attention
GQA + thinking budget
MoE routing
6 / 128 experts
Context
1M window
Modes
Reasoning ON / OFF
Core scenarios
Multi-agent / high concurrency
Light activation and high throughput cut costs for complex delegation and collaborative agents.
Long-chain reasoning & tools
Reasoning ON preserves chain-of-thought; OFF keeps chats concise. Budgets prevent token runaway.
Retrieval & memory
1M context for multi-doc, multi-hop aggregation across RAG, legal, ops, and research workloads.
Data & Training
Open pipeline from 25T pretrain to large-scale RL
Pre-training, long-context extension, SFT, RLVR, and RLHF are open for reproduction and customization.
Pre-training
25T tokens (2.5T new Common Crawl); broad coverage first, high-quality refinement second.
New open data
Extra 3T tokens
Denser code, math, and reasoning synthetic data.
Long-context extension
512k CPT + 4k mix preserves short-text quality while unlocking a 1M window with multi-hop synthetic sets.
Signals
Multi-doc · long memory
Prevents long-range decay while keeping short-context scores.
Post-training
SFT + RLVR + RLHF across math, code, tools, structured output; GenRM improves dialogue quality.
SFT data
13M samples
RL environments
10+ · 900k tasks
Safety traces
~11k agent traces
Reward model
GenRM (GRPO)
NeMo Gym
An open gym built for RL at scale
Solves multi-step rollout orchestration, brittle tool hooks, and slow data collection with unified envs, data, and code.
Problems solved
Multi-step rollouts are hard → synchronous GRPO pipelines and cross-environment scheduling.
Tool integrations are brittle → standardized interfaces between tools and training loops.
Quality envs are closed → 10+ open environments and 900k tasks (math, code, calendaring, etc.).
Developer wins
Ready-to-use RL envs to reproduce Nemotron 3 RLVR and RLHF recipes quickly.
Open traces and safety data to surface risks and drift before production.
Compatible with vLLM / SGLang for smooth train-to-serve delivery.
Deep dives
Browse core topics
Specs
31.6B params, 6/128 experts, 1M context, activations
How to run
vLLM / SGLang commands, hardware, concurrency tuning
API
REST / OpenRouter / SGLang examples and thinking budgets
License
NVIDIA OML commercial terms and attribution
Benchmarks
H200 throughput, active params, tuning tips
VS Qwen3
Throughput, context, Reasoning control, migration tips
Reasoning / Budgets
ON/OFF strategies, budgets, safety practices
Open resources
Models, data, and endpoints
Model weights
Nemotron 3 Nano 30B A3B
BF16 weights, MoE architecture, ready for vLLM / SGLang serving.
Hugging Face →Base & variants
Nemotron 3 8B Base 4k
Great for on-device or low-latency apps; extend with the same recipes.
View model →Deep dive
Architecture & training
Read the NVIDIA blog for design tradeoffs, RL infra, and data pipeline details.
Read blog →