N3

Nemotron

Next-gen open intelligent models

Run guide

How to Run Nemotron 3: Local and Cloud Deployment

Start Nemotron 3 with vLLM/SGLang: commands, hardware tips, and concurrency tuning for local or cloud deployments.

View vLLM command
how to run nemotron 3nemotron 3 deployvllm nemotron3sglang nemotronnemotron h200 inference

Quick start (vLLM)

  • Hardware: H100/H200, 80GB+ VRAM recommended.
  • Download: `huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`
  • Serve: `python -m vllm.entrypoints.api_server --model <path> --max-model-len 1024000 --enforce-eager`

SGLang serving

  • `sglang serve --model-path <path> --max-length 1024000 --tp 2`
  • Enable tensor parallel (TP) and KV cache optimizations for throughput.

High concurrency tips

  • Default Reasoning OFF for lightweight chats; toggle ON per task.
  • Tune `--max-num-batched-tokens` to keep latency predictable at 1M context.
  • Profile latency before scaling; trim prompts and budgets first.

Troubleshooting

  • OOM: lower `--max-model-len` or enable fp8/awq quantization.
  • Low throughput: check GPU utilization and storage bandwidth.
  • Cut responses: verify max_tokens and thinking budget settings.

FAQ

Minimum hardware?

Ideally H100/H200 80GB; multi-GPU tensor parallel works too.

Is quantization supported?

fp8/awq can reduce VRAM; validate quality for your workload.

How to set max context?

Pass `--max-model-len 1024000` in vLLM/SGLang with sufficient VRAM.