Quick start (vLLM)
- Hardware: H100/H200, 80GB+ VRAM recommended.
- Download: `huggingface-cli download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`
- Serve: `python -m vllm.entrypoints.api_server --model <path> --max-model-len 1024000 --enforce-eager`
Run guide
Start Nemotron 3 with vLLM/SGLang: commands, hardware tips, and concurrency tuning for local or cloud deployments.
Ideally H100/H200 80GB; multi-GPU tensor parallel works too.
fp8/awq can reduce VRAM; validate quality for your workload.
Pass `--max-model-len 1024000` in vLLM/SGLang with sufficient VRAM.