Key specs
- Architecture: Mamba‑2 + Transformer + sparse MoE (6/128 experts)
- Params: 31.6B total, ~3.6B active per token
- Context: 1,000,000 tokens (512k CPT + 4k mixed training)
- Attention: GQA + thinking budgets, Reasoning ON/OFF
- Weights: BF16, compatible with vLLM / SGLang serving