N3

Nemotron

Next-gen open intelligent models

Nano · Super · Ultra

Nemotron 3 Nano 30B: lightweight, deep reasoning for high-throughput agents

Hybrid Mamba‑2 + Transformer with sparse MoE and a 1M context window, refined by SFT + RLVR + RLHF. 31.6B total params with ~3.6B active per token delivers up to 4× the inference efficiency of the previous generation.

Context

1M tokens

Throughput uplift

≈3.3×

vs Qwen3-30B @H200

Reasoning control

ON/OFF

Configurable thinking budget

License

NVIDIA OML

Hybrid Mamba-Transformer

MoE · 6/128 Expert routing

Agent Ready
Active params~3.6B / token
Throughput (8K→16K)3.3× Qwen3-30B
Built forLong reasoning · Tools · Multi-agents

Context training

512k CPT + 4k mix

Serving stack

vLLM / SGLang

Control

Configurable thinking budget

Endpoints

OpenRouter · build.nvidia.com

TL;DR

Highlights

Architecture

Mamba-2 + Transformer + sparse MoE

Hybrid sequence modeling with GQA attention; MoE routes 6/128 experts for long context and precise reasoning.

Efficiency

High throughput, low latency

On a single H200 (8K→16K), ~3.3× Qwen3‑30B and 2.2× GPT‑OSS‑20B throughput—no slowdowns under agent fan-out.

Reasoning control

Toggle + budget

Supports Reasoning ON/OFF with a configurable thinking-token budget for predictable cost.

Context

1,000,000 tokens

512k CPT plus 4k mix extends context for long-horizon workflows, retrieval, and persistent memory.

Openness

Fully open stack

Open weights, data, recipes, and code under NVIDIA Open Model License for integration and reproduction.

Use cases

Reasoning · Tools · Agents

Strong accuracy across math, code, tool use, and multi-step agent tasks built for high-frequency calls.

Performance & architecture

Big-model reasoning with smaller active params

31.6B total params with ~3.6B active per token; MoE keeps inference light while boosting reasoning.

Active share~11%
Throughput vs Nano 2≈4×

Attention

GQA + thinking budget

MoE routing

6 / 128 experts

Context

1M window

Modes

Reasoning ON / OFF

Core scenarios

  • Multi-agent / high concurrency

    Light activation and high throughput cut costs for complex delegation and collaborative agents.

  • Long-chain reasoning & tools

    Reasoning ON preserves chain-of-thought; OFF keeps chats concise. Budgets prevent token runaway.

  • Retrieval & memory

    1M context for multi-doc, multi-hop aggregation across RAG, legal, ops, and research workloads.

Data & Training

Open pipeline from 25T pretrain to large-scale RL

Pre-training, long-context extension, SFT, RLVR, and RLHF are open for reproduction and customization.

Pre-training

25T tokens (2.5T new Common Crawl); broad coverage first, high-quality refinement second.

New open data

Extra 3T tokens

Denser code, math, and reasoning synthetic data.

Long-context extension

512k CPT + 4k mix preserves short-text quality while unlocking a 1M window with multi-hop synthetic sets.

Signals

Multi-doc · long memory

Prevents long-range decay while keeping short-context scores.

Post-training

SFT + RLVR + RLHF across math, code, tools, structured output; GenRM improves dialogue quality.

SFT data

13M samples

RL environments

10+ · 900k tasks

Safety traces

~11k agent traces

Reward model

GenRM (GRPO)

NeMo Gym

An open gym built for RL at scale

Solves multi-step rollout orchestration, brittle tool hooks, and slow data collection with unified envs, data, and code.

Problems solved

  • Multi-step rollouts are hard → synchronous GRPO pipelines and cross-environment scheduling.

  • Tool integrations are brittle → standardized interfaces between tools and training loops.

  • Quality envs are closed → 10+ open environments and 900k tasks (math, code, calendaring, etc.).

Developer wins

  • Ready-to-use RL envs to reproduce Nemotron 3 RLVR and RLHF recipes quickly.

  • Open traces and safety data to surface risks and drift before production.

  • Compatible with vLLM / SGLang for smooth train-to-serve delivery.

Deep dives

Browse core topics

Open resources

Models, data, and endpoints

Continuously updated

Model weights

Nemotron 3 Nano 30B A3B

BF16 weights, MoE architecture, ready for vLLM / SGLang serving.

Hugging Face →

Base & variants

Nemotron 3 8B Base 4k

Great for on-device or low-latency apps; extend with the same recipes.

View model →

Deep dive

Architecture & training

Read the NVIDIA blog for design tradeoffs, RL infra, and data pipeline details.

Read blog →