How to Fine-Tune Nemotron-3-Nano on Multi-GPUs

Carolyn Weitz

Last Updated: Feb 3, 2026

11 Minute Read

840 Views

How to Fine-Tune Nemotron-3-Nano on Multi-GPUs

To fine-tune Nemotron-3-Nano on multi-GPUs, you need a workflow that’s simple enough to run locally without hitting VRAM limits, NCCL hiccups or template quirks.

NVIDIA’s Nemotron 3 Nano delivers big-model capability in a practical form factor: a ~30B-parameter Mamba–Transformer Mixture-of-Experts model with roughly 3.6B active parameters per token, built for fast, accurate coding, math and agentic tasks.

It supports up to a 1M-token context window (with a default configuration around 256k) and ranks strongly across SWE-Bench, GPQA Diamond, reasoning, chat and throughput for its size class. In practice, it can run on ~24GB VRAM (or unified memory) for inference with quantization or offloading; full BF16 training typically requires 80GB-class GPUs.

According to McKinsey report, 88 percent report regular AI use in at least one business function.

This guide walks you step-by-step from setup to training, debugging and serving.

Prerequisites

These prerequisites set the constraints for everything that follows, including VRAM, context length, stability and serving options.

Prerequisite A: Pick a base checkpoint that matches your goal

Use one of these exact checkpoints to avoid pulling the wrong variant:

Base BF16: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, is a safer starting point for instruction tuning because your dataset defines behavior from a less biased baseline.
Chat BF16: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, is a better choice when you want to preserve an assistant style and only adapt domain knowledge or formatting.

This choice matters because instruction-tuned checkpoints can resist style changes unless your data is consistent and high-signal.

Prerequisite B: Choose LoRA vs QLoRA based on your VRAM margin

QLoRA (4-bit) is the default for most setups because it reduces weight memory and lowers the chance of OOM during evaluation and checkpoint peaks.
16-bit LoRA can work on 80GB-class GPUs, although Unsloth notes Nemotron 3 Nano 16-bit LoRA fine-tuning can use around ~60GB VRAM at moderate sequence lengths and batch sizes. Longer contexts or larger effective batches will push that number higher, which is why most teams start with QLoRA.
This choice matters because DDP replicates the model per GPU, therefore each GPU must carry the full per-rank footprint.

Prerequisite C: Set a realistic context-length plan

Start with a modest MAX_SEQ_LEN like 2K to 8K, then increase after the run is stable and checkpointing is proven.
The model supports up to a 1M-token context, but the published Hugging Face config defaults to 256k, with a recommended max input length of 128k tokens. Longer contexts require custom configuration and significantly more VRAM.
Context length drives attention and KV-related memory, which often dominates memory well before adapter weights matter.

Prerequisite D: Decide how you will handle thinking traces

Nemotron can emit traces through template controls, which changes training text and expected outputs.
Pick one policy early, then keep it consistent across dataset formatting, evaluation prompts and serving settings.

Step 1: Verify Multi-GPU is Visible

This step proves the host and driver layer can see the exact GPUs you intend to train on.

Run:

nvidia-smi

What to validate:

All expected GPUs appear and each shows the same driver branch and healthy temperatures.
No unexpected processes occupy VRAM, because small allocations can trigger early fragmentation and OOM.
The GPU order is stable, because device mapping issues become harder to debug after distributed launch.

Step 2: Create the Environment and Install Dependencies

This step removes version drift, which is one of the most common causes of distributed runtime failures.

Recommended approach:

Use a clean virtual environment, then pin key packages like PyTorch, Transformers, Accelerate and bitsandbytes.
Install Unsloth from source when you want the training examples and internal loader behavior to match your environment.
This approach reduces “import works on rank 0 but fails on rank 3” issues caused by mismatched wheels or caches.

A practical install flow:

python -m venv .venv 

source .venv/bin/activate 

pip install -U pip 

git clone https://github.com/unslothai/unsloth.git 

cd unsloth 

pip install . 

cd .. 

pip install -U "torch" "transformers" "datasets" "trl" "accelerate" "peft" "bitsandbytes"

Nemotron-specific dependency callout

\Nemotron includes Mamba components, therefore fast Mamba kernels require mamba-ssm and causal-conv1d on CUDA.

Install them explicitly to avoid missing op and import errors:

pip install "mamba-ssm[causal-conv1d]"

Quick stack verification

Run this once to confirm CUDA is usable from PyTorch:

python -c "import torch; print(torch.version, torch.version.cuda, torch.cuda.device_count())"

Step 3: Prepare Your Dataset in Chat Format

This step ensures the model learns the behavior you want, rather than learning prompt artifacts or broken role structure.

Use one JSONL row per conversation, for example:

{"messages":[ 

  {"role":"user","content":"Explain gradient accumulation in simple terms."}, 

  {"role":"assistant","content":"Gradient accumulation lets you simulate bigger batches by..."} 

]}

Add one validation pass before training

Render five samples with apply_chat_template and inspect the final prompt text, because template mistakes often look like “bad tuning.”

Thinking-trace consistency rule

If thinking traces are enabled in formatting, your eval prompts should also expect that behavior, otherwise scoring and reviews become inconsistent.

Step 4: Write a Minimal Training Script that can Scale

This step creates a single source of truth for loading, formatting, adapter attachment and training arguments.

Runnable minimal train.py

This script loads the base model, applies chat templating, attaches QLoRA or LoRA and saves adapter artifacts.

train.py 

import os import torch from datasets import load_dataset from trl import SFTTrainer, SFTConfig from unsloth import FastLanguageModel 

os.environ["TOKENIZERS_PARALLELISM"] = "false" 

MODEL_NAME = os.environ.get("MODEL_NAME", "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16") DATA_PATH = os.environ.get("DATA_PATH", "train.jsonl") OUT_DIR = os.environ.get("OUT_DIR", "outputs") MAX_SEQ_LEN = int(os.environ.get("MAX_SEQ_LEN", "4096")) 

QLoRA default 

LOAD_IN_4BIT = os.environ.get("LOAD_IN_4BIT", "1") == "1" ENABLE_THINKING = os.environ.get("ENABLE_THINKING", "true").lower() == "true" 

def formatting_func(example, tokenizer): return tokenizer.apply_chat_template( example["messages"], tokenize=False, add_generation_prompt=False, enable_thinking=ENABLE_THINKING, ) 

def main(): torch.backends.cuda.matmul.allow_tf32 = True 

dataset = load_dataset("json", data_files={"train": DATA_PATH}, split="train") 
 
model, tokenizer = FastLanguageModel.from_pretrained( 
    model_name=MODEL_NAME, 
    max_seq_length=MAX_SEQ_LEN, 
    load_in_4bit=LOAD_IN_4BIT, 
    trust_remote_code=True, 
) 
 
model = FastLanguageModel.get_peft_model( 
    model, 
    r=16, 
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], 
    lora_alpha=16, 
    lora_dropout=0.0, 
    bias="none", 
    use_gradient_checkpointing="unsloth", 
    max_seq_length=MAX_SEQ_LEN, 
    random_state=3407, 
) 
 
args = SFTConfig( 
    output_dir=OUT_DIR, 
    max_seq_length=MAX_SEQ_LEN, 
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=8, 
    learning_rate=2e-4, 
    warmup_steps=50, 
    num_train_epochs=1, 
    logging_steps=10, 
    save_steps=200, 
    optim="adamw_8bit", 
    bf16=True, 
    seed=3407, 
) 
 
trainer = SFTTrainer( 
    model=model, 
    tokenizer=tokenizer, 
    train_dataset=dataset, 
    args=args, 
    formatting_func=lambda ex: formatting_func(ex, tokenizer), 
) 
 
trainer.train() 
 
# Save adapters and tokenizer for reproducible serving 
model.save_pretrained(f"{OUT_DIR}/lora_adapters") 
tokenizer.save_pretrained(f"{OUT_DIR}/lora_adapters") 
  

if name == "main": main()

Adapter injection warning

If a target module name fails, print model modules and adjust the list, because architectures vary across releases.

Step 5: Run a Single-GPU Sanity Checks First

This step confirms that the training loop is valid before adding distributed coordination and additional failure modes.

Run:

CUDA_VISIBLE_DEVICES=0 python train.py

What “sanity” means in practice:

The job should complete at least 50 to 200 steps without NaNs, hangs or repeated OOM.
A checkpoint should save successfully, then reload without shape mismatches or missing tokenizer files.
Loss should move in the right direction, even if the early curve is noisy on small batches.

Step 6: Configure Accelerate for Multi-GPU

This step standardizes how ranks are created, how mixed precision is applied and how devices are assigned.

Run:

accelerate config

What to aim for:

Local machine, multi-GPU, then bf16 where supported for stable training on modern NVIDIA GPUs.
A saved config that can be reused across runs, which reduces accidental launcher differences.

Step 7: Launch Multi-GPU Fine-tuning

This step activates DDP behavior, where each GPU runs a full replica and gradients sync every step.

Option A, Accelerate:

accelerate launch --num_processes 4 train.py

Option B, torchrun:

torchrun --standalone --nproc_per_node=4 train.py

One output directory rule

Only one process should write checkpoints and adapters to the shared output directory, otherwise artifacts can collide.

Step 8: Monitor Stability and Intervene Early

This step prevents long, expensive runs that fail late due to predictable scaling bottlenecks.

During training, watch:

watch -n 1 nvidia-smi

What to monitor and why:

VRAM per GPU should be similar, because imbalance often indicates padding waste or data sharding issues.
Step time variance often predicts upcoming hangs, especially when checkpointing and storage are slow.
Tokens per second reveals dataloader bottlenecks, which frequently erase the benefit of extra GPUs.

If you hit OOM, change in this order:

Reduce MAX_SEQ_LEN, because sequence length multiplies several memory costs.
Reduce per_device_train_batch_size, because microbatch peaks trigger allocator fragmentation.
Increase gradient_accumulation_steps, because it preserves effective batch size without raising peaks.
Keep QLoRA enabled during stabilization, because it preserves headroom for long-context experiments.

NCCL quick triage checklist

Use these checks when the job hangs at startup or stalls mid-run:

Set NCCL_DEBUG=INFO and re-run, because logs identify which rank is stuck.
Set NCCL_SOCKET_IFNAME to the correct interface when multiple NICs exist.
Reduce to two GPUs and re-run, because it isolates topology and bandwidth problems quickly.
Move checkpoint writes to local SSD, because slow network storage can mimic NCCL timeouts.

Step 9: If DDP Does Not Fit, Switch to Sharding

This step gives a clear decision path when the model footprint does not fit per GPU under DDP.

When to stay on DDP

Stay on DDP when the model, adapters and activations fit on each GPU with stable headroom.

When to switch to FSDP or ZeRO-style sharding

Switch when you see OOM during forward pass, evaluation or save, even after reducing sequence length and microbatch size.

Practical rule:

If the model cannot fit per GPU, sharding is the next step, not more GPUs under DDP.

Step 10: Export a Merged Model (optional)

This step converts “base + adapters” into one artifact for simpler serving pipelines.

When merging helps:

A single directory is easier to deploy when tooling cannot load adapters.
Merged weights reduce runtime configuration complexity.

When merging hurts:

Merge can require high CPU RAM and time, which slows iteration.
Keeping adapters separate supports rapid testing and rollback.

Step 11: Serve Across Multiple GPUs with vLLM Tensor Parallel

This step validates inference behavior, including template handling and multi-GPU sharding.

Recipe-aligned serving baseline

At the time of writing, Nemotron examples run against vLLM 0.12.0; always check the latest Nemotron + vLLM docs and pin to a compatible version (APIs change fairly quickly). Also, provides a baseline with –max-model-len 262144 plus the reasoning parser plugin.

Install and download the parser:

pip install -U "vllm==0.12.0" 

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

Serve a merged model across four GPUs:

vllm serve outputs/merged \ 

  --trust-remote-code \ 

  --tensor-parallel-size 4 \ 

  --max-model-len 262144 \ 

  --port 8000 \ 

  --reasoning-parser-plugin nano_v3_reasoning_parser.py \ 

  --reasoning-parser nano_v3

Serve without merging, using adapters

vLLM supports LoRA adapters, therefore you can serve the base model and load adapters without producing a merged checkpoint.

Start the server with LoRA enabled:

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ 

  --trust-remote-code \ 

  --tensor-parallel-size 4 \ 

  --max-model-len 262144 \ 

  --port 8000 \ 

  --enable-lora \ 

  --max-loras 1 \ 

  --max-lora-rank 16 \ 

  --reasoning-parser-plugin nano_v3_reasoning_parser.py \ 

  --reasoning-parser nano_v3

If you want runtime adapter loading by name, vLLM documents a resolver plugin workflow using environment variables and a local adapter cache directory.

Step 12: Send a Quick Smoke Test Request

This step confirms end-to-end behavior, including routing, tokenization, generation and response format.

Example request:

curl http://localhost:8000/v1/chat/completions \

-H “Content-Type: application/json” \

-d ‘{

“model”: “model”,

“messages”:[{“role”:”user”,”content”:”Write a haiku about multi-GPU training.”}],

“max_tokens”: 512

}’

What to validate in the response:

The server returns a well-formed JSON payload without template errors.
The content matches your thinking policy and chat formatting expectations.
Latency and GPU utilization look reasonable for your chosen tensor parallel size.

Ready to Fine-Tune Faster on Real GPU Clusters?

Now you’ve got the runbook to fine-tune Nemotron-3-Nano on multi-GPUs, validate outputs and serve with vLLM.

When you’re ready to scale beyond a local box, AceCloud helps you spin up NVIDIA H200, H100, A100, RTX Pro 6000 or L40S GPUs in minutes with pay-as-you-go pricing and a 99.99%* uptime SLA.

Want hands-on guidance? Book a demo or talk to an expert plus use the free credits (₹ 20,000) to benchmark your configs, throughput and cost before committing.

Need a production landing zone? Pair GPUs with AceCloud’s managed Kubernetes and VPC networking for secure, repeatable training and serving pipelines.

Frequently Asked Questions

What hardware is needed for multi-GPU fine-tuning of Nemotron-3-Nano?

Start with LoRA or QLoRA, then size GPUs around sequence length and batch needs. Unsloth notes 16-bit LoRA fine-tuning can use around 60GB VRAM, which often pushes teams to QLoRA first.

Should you use LoRA or full-parameter tuning on multi-GPUs?

Use LoRA or QLoRA for most teams, because they reduce VRAM pressure and make failures easier to recover. Choose full tuning only with strong data, rigorous eval and enough memory headroom.

How do you configure distributed training for Nemotron-3-Nano?

Use accelerate launch or torchrun for DDP once your single-GPU script is stable. DDP replicates the model per GPU and synchronizes gradients each step.

Which precision is best, BF16 or FP8?

Start with BF16 for training stability, then evaluate FP8 serving when hardware and stack support it. vLLM’s recipe notes BF16 and FP8 model variants for Nemotron-3-Nano.

Carolyn Weitz

author

Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.