N3

Nemotron

Next-gen open intelligent models

API & SDK

Nemotron 3 API: Calls, Examples, and Budgets

Call Nemotron 3 via vLLM, SGLang, or OpenRouter with examples for streaming, tool use, and thinking-budget control.

nemotron 3 apinemotron openrouternemotron 3 sglang apinemotron thinking budgetnemotron tool calling

REST example (vLLM)

  • POST `/generate`: `{ "prompt": "<text>", "max_tokens": 256 }`
  • Budget: set max thinking tokens in prompt or request params.
  • Streaming: use `Accept: text/event-stream` for SSE.

SGLang

  • Python: `from sglang import client; client.chat(model="nemotron3", messages=[...])`
  • Tools: declare functions in schema; Reasoning ON keeps chain-of-thought.

OpenRouter / build.nvidia.com

  • Model: `nvidia/nemotron-3-nano-30b-instruct`, supports budgets.
  • Enterprise: use build.nvidia.com for hosted inference and SLAs.

Multi-agent guidance

  • Default OFF for fan-out, switch ON selectively.
  • Align tool schemas to avoid parsing conflicts.
  • Log token usage to tune budgets per task type.

FAQ

Is streaming supported?

Yes, vLLM and OpenRouter support SSE streaming responses.

How to cap thinking tokens?

Declare a max thinking-token budget in the prompt or request payload; combine with ON/OFF toggles.

Does tool use work out of the box?

Yes, with function/tool schemas; limit function count to control cost.