Run guide

How to Run Nemotron 3: Local and Cloud Deployment

Start Nemotron 3 with vLLM/SGLang: commands, hardware tips, and concurrency tuning for local or cloud deployments.

how to run nemotron 3nemotron 3 deployvllm nemotron3sglang nemotronnemotron h200 inference

Quick start (vLLM)

Hardware: H100/H200, 80GB+ VRAM recommended.
Download: `huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`
Serve: `python -m vllm.entrypoints.api_server --model <path> --max-model-len 1024000 --enforce-eager`

Minimum hardware?

Ideally H100/H200 80GB; multi-GPU tensor parallel works too.

Is quantization supported?

fp8/awq can reduce VRAM; validate quality for your workload.

How to set max context?

Pass `--max-model-len 1024000` in vLLM/SGLang with sufficient VRAM.