N3

Nemotron

Next-gen open intelligent models

Benchmarks

Nemotron 3 Benchmarks: Throughput, Latency, and Context

On H200, Nemotron 3 Nano 30B delivers ~3.3× Qwen3-30B throughput (8K→16K) with ~3.6B active params and a 1M context window.

nemotron 3 benchmarksnemotron throughputqwen3 comparisonh200 inferencenemotron latency

Key metrics

  • ~3.3× Qwen3-30B throughput on H200 (8K→16K)
  • ~3.6B active params (~11% active share)
  • 1M context with stable long-context performance

Test setup

  • Hardware: H200, fp16/bf16, GQA + KV cache enabled
  • Software: vLLM / SGLang with tuned batching for concurrency

Best practices

  • Use Reasoning OFF for short chats to cut thinking tokens
  • Set thinking budgets for long chains to avoid runaway tokens
  • Tune batch size/parallelism and watch p99 latency before scaling

FAQ

Why is throughput higher?

Sparse MoE (6/128) and low active params reduce per-token compute.

Does 1M context hurt speed?

With proper max_tokens and batching, throughput remains stable for long context.

Are benchmark scripts available?

Reuse vLLM/SGLang benchmark scripts with updated max length settings.