Multi-GPU LLM Training on Kubernetes: Build a Production-Ready Pipeline

Carolyn Weitz

Last Updated: May 20, 2026

15 Minute Read

524 Views

Multi-GPU LLM Training on Kubernetes: Build a Production-Ready Pipeline

Multi-GPU LLM training has become essential as teams move from small model experiments to larger fine-tuning and pretraining workloads. A single GPU often becomes a bottleneck because of memory limits, long training cycles, and slow iteration speed.

However, production-grade multi-GPU training is not just about requesting more GPUs. A reliable pipeline needs GPU-aware scheduling, reproducible containers, distributed launch coordination, checkpoint recovery, NCCL-ready networking, monitoring, cost controls, and security guardrails.

Kubernetes provides the operational foundation for this. It gives platform teams a consistent way to schedule GPU workloads, isolate teams, enforce quotas, connect with CI/CD, and standardize training jobs across environments.

In this guide, we will build a practical production pipeline for multi-GPU LLM training on Kubernetes using NVIDIA GPU Operator, PyTorch torchrun, Kubeflow Trainer, checkpoint storage, DCGM-based GPU monitoring, and scheduling policies.

Reference Architecture: What You Are Building

Use this as the target architecture for a production-ready training platform.

At a high level, the pipeline contains:

Layer	What it does
GPU infrastructure	Dedicated GPU node pools with labels, taints, and GPU drivers
GPU enablement	NVIDIA GPU Operator or NVIDIA device plugin exposes GPUs to Kubernetes
Training runtime	PyTorch, Hugging Face Accelerate, Kubeflow Trainer, or another distributed launcher
Scheduling control	Taints, tolerations, affinity, queues, quotas, and gang scheduling where needed
Storage	Shared filesystem or object storage for datasets, checkpoints, and artifacts
Observability	GPU telemetry, training metrics, logs, traces, and cluster events
Governance	RBAC, Secrets, image scanning, namespace isolation, and cost policies

The NVIDIA GPU Operator is useful because it can automate several GPU software components, including NVIDIA drivers, the Kubernetes device plugin, NVIDIA Container Toolkit, GPU Feature Discovery, and DCGM-based monitoring.

Prerequisites to Consider for Multi-GPU LLM Training

Before building the pipeline, align on these platform requirements.

Requirement	Recommendation
Kubernetes cluster	Use a supported Kubernetes version with GPU node pools
Container runtime	Use containerd or another supported runtime configured for NVIDIA containers
GPU enablement	Use NVIDIA GPU Operator for production lifecycle management
Training framework	Start with PyTorch torchrun; standardize later on Kubeflow Trainer if needed
Storage	Use object storage or a shared filesystem for checkpoints
Networking	Validate low-latency east-west traffic before multi-node training
Observability	Deploy Prometheus, Grafana, DCGM Exporter, and centralized logs
Security	Use namespace RBAC, Secrets, image scanning, and least-privilege service accounts

Single-Node vs Multi-Node Multi-GPU Training

Before implementing anything, decide which training pattern you actually need.

Pattern	Use when	Trade-off
Single-node multi-GPU	The model fits on GPUs within one server	Simpler networking and easier debugging
Multi-node DDP	You need more GPUs than one node provides	Requires stronger network and launch coordination
FSDP / ZeRO	Model or optimizer state does not fit in GPU memory	More complex checkpointing and tuning
Tensor parallelism	Individual layers are too large for one GPU	Requires model-specific distributed strategy
Pipeline parallelism	Model can be split into sequential stages	Can introduce pipeline bubbles
Hybrid parallelism	Very large pretraining or frontier-scale workloads	Highest operational complexity

Kubeflow Trainer supports PyTorch distributed training and can run DDP, FSDP, FSDP2, and other PyTorch-supported distributed algorithms.

A practical rollout path is:

Validate one GPU.
Validate one node with multiple GPUs.
Validate multi-node distributed training.
Add checkpoint restore testing.
Add queueing, quotas, monitoring, and cost controls.

Step 1: Provision GPU Nodes and Isolate Them

Create a dedicated GPU node pool instead of mixing GPU and CPU workloads on the same nodes. Then label and taint GPU nodes so only approved workloads can land there.

Example labels:

kubectl label node gpu-node-1 accelerator=nvidia
kubectl label node gpu-node-1 gpu.model=a100
kubectl label node gpu-node-1 workload-type=gpu-training
kubectl label node gpu-node-1 topology.kubernetes.io/zone=zone-a

Example taint:

kubectl taint nodes gpu-node-1 dedicated=gpu-training:NoSchedule

Training workloads must include a matching toleration:

tolerations:
- key: "dedicated"
  operator: "Equal"
  value: "gpu-training"
  effect: "NoSchedule"

Use node affinity to keep training pods on the intended GPU pool:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: workload-type
          operator: In
          values:
          - gpu-training
        - key: gpu.model
          operator: In
          values:
          - a100

This avoids accidental placement on the wrong nodes and makes scheduling behavior easier to debug.

Step 2: Enable GPUs for Kubernetes Scheduling

Kubernetes uses device plugins to expose hardware resources such as GPUs to the kubelet. After the GPU plugin is installed, nodes advertise resources such as nvidia.com/gpu, and pods can request them in their container limits.

For production, prefer NVIDIA GPU Operator because it manages the GPU software lifecycle more completely. For a minimal setup, you can use the NVIDIA device plugin.

Validate GPU capacity

kubectl describe node gpu-node-1 | grep -A5 "Capacity"
kubectl describe node gpu-node-1 | grep -A5 "Allocatable"

You should see something similar to:

nvidia.com/gpu: 8

Run a GPU smoke test

apiVersion: v1
kind: Pod
metadata:
  name: gpu-smoke-test
spec:
  restartPolicy: Never
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "gpu-training"
    effect: "NoSchedule"
  nodeSelector:
    workload-type: gpu-training
  containers:
  - name: cuda
    image: nvidia/cuda:12.4.1-base-ubuntu22.04
    command: ["bash", "-lc", "nvidia-smi && sleep 10"]
    resources:
      limits:
        nvidia.com/gpu: 1

Apply and inspect logs:

kubectl apply -f gpu-smoke-test.yaml
kubectl logs gpu-smoke-test

NOTE: Kubernetes GPU resources should be specified in limits. If requests are also specified, the request and limit values must match.

Step 3: Containerize Training for Reproducibility

Your training image should be treated as an immutable release artifact.

A basic Dockerfile can look like this:

FROM pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime

WORKDIR /workspace

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY train.py .
COPY configs/ ./configs/

ENV PYTHONUNBUFFERED=1
ENV TOKENIZERS_PARALLELISM=false

CMD ["python", "train.py"]

Recommended image practices:

Practice	Why it matters
Pin PyTorch, CUDA, Transformers, Accelerate, and NCCL-compatible versions	Reduces runtime drift
Tag images with git SHA	Makes every run traceable
Log image digest per run	Supports auditability
Externalize hyperparameters	Prevents hidden config drift
Store training config in Git	Enables review and rollback

Example image tag:

docker build -t registry.example.com/llm-trainer:${GIT_SHA} .
docker push registry.example.com/llm-trainer:${GIT_SHA}

For every training run, log:

run_id
git_commit
image_digest
dataset_version
config_hash
world_size
checkpoint_path

Step 4: Run a Single-Node Multi-GPU Training Job

Start with single-node multi-GPU training before moving to multi-node. It is easier to debug and avoids distributed networking issues.

Example Kubernetes Job using PyTorch torchrun:

apiVersion: batch/v1
kind: Job
metadata:
  name: llm-train-single-node
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: llm-training
        run-id: run-2026-01-20-a
    spec:
      restartPolicy: Never
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "gpu-training"
        effect: "NoSchedule"
      nodeSelector:
        workload-type: gpu-training
        gpu.model: a100
      containers:
      - name: trainer
        image: registry.example.com/llm-trainer:REPLACE_WITH_GIT_SHA
        command:
        - bash
        - -lc
        - |
          torchrun \
            --standalone \
            --nproc_per_node=8 \
            train.py \
            --config configs/train.yaml \
            --output_dir /checkpoints/run-2026-01-20-a
        env:
        - name: NCCL_DEBUG
          value: "WARN"
        - name: CUDA_DEVICE_MAX_CONNECTIONS
          value: "1"
        resources:
          requests:
            cpu: "32"
            memory: "256Gi"
          limits:
            nvidia.com/gpu: 8
            memory: "256Gi"
        volumeMounts:
        - name: checkpoints
          mountPath: /checkpoints
      volumes:
      - name: checkpoints
        persistentVolumeClaim:
          claimName: llm-checkpoints-pvc

This pattern is useful when one node has enough GPU memory and compute for the workload. For example, one 8-GPU node is often easier to operate than an 8-GPU job spread across several nodes.

Step 5: Move to Multi-Node Training

Move to multi-node training only after single-node training is stable. Multi-node jobs add coordination, networking, rank assignment, failure handling, and NCCL tuning.

For production teams, use a controller such as Kubeflow Trainer instead of hand-rolling worker pods. Kubeflow Trainer automatically configures PyTorch distributed environment variables such as WORLD_SIZE, RANK, and LOCAL_RANK, and it uses torchrun for PyTorch distributed jobs.

Example Kubeflow Trainer-style Python workflow:

from kubeflow.trainer import TrainerClient, CustomTrainer

def train_func():
    import os
    import torch
    import torch.distributed as dist

    backend = "nccl" if torch.cuda.is_available() else "gloo"
    dist.init_process_group(backend=backend)

    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    torch.cuda.set_device(local_rank)

    print({
        "world_size": dist.get_world_size(),
        "rank": dist.get_rank(),
        "local_rank": local_rank,
    })

    # Import and call your actual training loop here.
    # train_model(config_path="/workspace/configs/train.yaml")

client = TrainerClient()

client.train(
    trainer=CustomTrainer(
        func=train_func,
        resources_per_node={
            "gpu": 8,
            "cpu": 32,
            "memory": "256Gi",
        },
        num_nodes=2,
    )
)

Use multi-node training when:

Signal	Interpretation
One node cannot fit the model	Use FSDP, ZeRO, tensor parallelism, or larger GPU memory
Single-node training is too slow	Scale workers across nodes
GPU utilization is high but wall-clock time is too long	Multi-node may help
GPU utilization is low	Fix data pipeline before adding nodes
NCCL tests are unstable	Do not scale yet

Step 6: Add Gang Scheduling or Queueing for Large Jobs

Distributed training jobs often require all workers to start together. If only half the workers start, the job can hang while still holding expensive GPUs.

This is where gang scheduling or queue-based admission helps. Kubernetes v1.35 includes alpha gang scheduling support that schedules a group of pods on an all-or-nothing basis when the configured minimum pod count can be satisfied.

Use gang scheduling, Volcano, Kueue, or another batch-aware scheduler when:

Problem	Why it matters
Partial worker placement	Wastes GPUs and blocks other jobs
Multiple teams submit large jobs	Requires fairness and queueing
Cluster has mixed GPU types	Needs policy-based placement
Spot capacity is used	Requires retry and checkpoint-aware scheduling
Long-running pretraining jobs	Needs predictable admission and preemption behavior

For most teams, this rollout works well:

Start with native Kubernetes Jobs.
Move distributed jobs to Kubeflow Trainer.
Add Kueue, Volcano, or gang scheduling when large multi-pod jobs create contention.
Add priority classes and quotas for team-level fairness.

Step 7: Configure Checkpoint Storage and Restore

Checkpointing is not optional for production LLM training. Node failures, preemptions, networking issues, and job evictions are normal at scale.

A production checkpoint should include:

Checkpoint item	Why it matters
Model weights	Required to resume training
Optimizer state	Prevents training instability after resume
Scheduler state	Keeps learning rate progression correct
Global step	Avoids duplicate or skipped steps
RNG state	Improves reproducibility
Tokenizer and model config	Prevents mismatch during restore
Parallelism strategy	Needed for FSDP, ZeRO, or sharded checkpoints
Dataset cursor or consumed samples	Prevents data replay errors

Example PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-checkpoints-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Ti
  storageClassName: fast-shared-storage

Recommended checkpoint policy:

Policy	Recommendation
Time-based	Save every 30–60 minutes
Step-based	Save every N global steps
Milestone-based	Save at evaluation milestones
Retention	Keep latest, best, and last-known-good checkpoints
Restore drill	Test restore before production launch

A checkpoint is only useful if it can be restored. Add a staging restore drill to every major training pipeline release.

Step 8: Validate Networking and NCCL Readiness

Multi-node training often fails at the network layer before it fails at the model layer.

Before scaling, validate:

Check	Why it matters
Same GPU model across workers	Reduces performance variance
Same driver, CUDA, and NCCL-compatible stack	Prevents hard-to-debug runtime issues
Low-latency east-west traffic	Required for collective communication
MTU consistency	Prevents packet fragmentation and throughput drops
Zone placement	Cross-zone collectives can be slower
NCCL test	Confirms collective communication before expensive training

Useful NCCL environment variables:

env:
- name: NCCL_DEBUG
  value: "WARN"
- name: NCCL_IB_DISABLE
  value: "0"
- name: NCCL_SOCKET_IFNAME
  value: "eth0"
- name: TORCH_DISTRIBUTED_DEBUG
  value: "DETAIL"

Do not blindly copy these into every environment. Validate the correct network interface, transport, and driver behavior for your cluster.

Step 9: Build Observability into the Pipeline

Observability should be part of the training platform from day one.

Track three layers of metrics.

GPU metrics

Metric	Why it matters
GPU utilization	Shows whether GPUs are actually busy
GPU memory used	Helps detect OOM risk
GPU power draw	Useful for cost and throttling analysis
Temperature	Detects thermal throttling
ECC or hardware errors	Detects failing GPUs

Training metrics

Metric	Why it matters
Step time	Shows training speed
Tokens per second	Best throughput signal for LLMs
Data loader time	Detects input bottlenecks
Loss and eval metrics	Tracks model quality
Checkpoint duration	Detects slow storage
Resume success rate	Validates failure recovery

Kubernetes metrics

Metric	Why it matters
Pod restarts	Detects instability
Pending pods	Shows scheduling pressure
Evictions	Detects node pressure
Node allocatable GPUs	Confirms GPU availability
PVC latency	Detects storage bottlenecks
Queue wait time	Shows scheduling efficiency

Example Prometheus queries:

avg by (pod) (DCGM_FI_DEV_GPU_UTIL)
avg by (pod) (DCGM_FI_DEV_FB_USED)
increase(kube_pod_container_status_restarts_total[15m])
kube_pod_status_phase{phase="Pending"}

Recommended alerts:

Alert	Suggested trigger
Low GPU utilization	GPU utilization below 40% for 15 minutes
Checkpoint slowdown	Checkpoint duration above expected threshold
Pod restart loop	Restart count increases repeatedly
Training pod pending	Pending for more than 10 minutes
NCCL timeout	Error pattern found in logs
Node pressure	Memory, disk, or PID pressure on GPU nodes

Step 10: Optimize Dataset Throughput

Many teams add GPUs before fixing the input pipeline. That usually increases cost without improving throughput.

Common dataset bottlenecks:

Bottleneck	Fix
Slow object storage reads	Add caching or local staging
Too few data loader workers	Tune workers per GPU
Large unsharded files	Use sharded datasets
Cross-zone data access	Place data near GPU nodes
Checkpoint writes competing with dataset reads	Separate storage paths or backends

Recommended practices:

Use sharded datasets.
Warm hot data to local NVMe when possible.
Track data loader time separately from GPU compute time.
Keep checkpoint writes from blocking training.
Log samples/sec and tokens/sec for every run.

A simple rule: if GPU utilization is low, do not scale to more GPUs until the input pipeline is fixed.

Step 11: Add Cost Controls

Multi-GPU training can become expensive quickly. Cost controls should be enforced at the platform level, not left to individual teams.

Basic cost formula:

training_cost = GPU_hour_price × GPU_count × wall_clock_hours

Example:

$3.00 per GPU-hour × 8 GPUs × 20 hours = $480

Cost levers:

Lever	How it helps
ResourceQuota	Prevents teams from consuming unlimited GPUs
PriorityClass	Protects critical jobs
Queueing	Prevents partial starts and unfair usage
Checkpointing	Makes preemptible or spot capacity safer
Autoscaling	Reduces idle GPU nodes
Dataset caching	Improves utilization
Mixed precision	Reduces memory and improves speed
Right-sized CPU/memory	Prevents GPU starvation from CPU bottlenecks

Example namespace quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-training-quota
  namespace: team-llm
spec:
  hard:
    requests.cpu: "512"
    requests.memory: "4Ti"
    limits.nvidia.com/gpu: "32"

Example priority class:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-training
value: 100000
globalDefault: false
description: "Priority class for approved production LLM training jobs."

Use lower priority for experiments and higher priority for approved production runs.

Step 12: Use GPU Sharing Carefully

GPU sharing can improve utilization, but it is not always appropriate for LLM training.

NVIDIA GPU time-slicing allows workloads to share GPU time, but unlike MIG, time-slicing does not provide memory or fault isolation between replicas.

Mode	Best for	Avoid for
Full GPU	LLM training, fine-tuning, performance-sensitive jobs	Small dev tasks
MIG	Isolated inference, smaller training, multi-tenant workloads	Models needing full GPU memory
Time-slicing	Notebooks, CI tests, lightweight dev	Production LLM training
MPS	Some shared CUDA workloads	Strict isolation requirements

For serious multi-GPU LLM training, full GPU allocation is usually the safest default.

Step 13: Secure the Training Platform

Security controls matter because training jobs often access private datasets, model weights, credentials, and internal infrastructure.

Baseline controls:

Control	Recommendation
Namespace isolation	Separate teams and environments
RBAC	Use least-privilege service accounts
Secrets	Store credentials in Kubernetes Secrets or external secret managers
NetworkPolicy	Restrict unnecessary egress
Image scanning	Scan training images before deployment
Signed images	Enforce trusted images through admission policies
Audit logs	Track who launched jobs and with which config
Data access	Restrict dataset buckets by team and workload

Example service account:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: llm-trainer
  namespace: team-llm

Example pod usage:

spec:
  serviceAccountName: llm-trainer

Do not run training jobs with broad cluster-admin permissions.

Ready to scale beyond one GPU?

Deploy on AceCloud cloud GPUs and operationalize multi-GPU training on Kubernetes.

Production Readiness Checklist

Use this checklist before declaring the pipeline production-ready.

Area	Check
GPU nodes	Dedicated GPU node pool exists
Labels and taints	GPU nodes are labeled and tainted
GPU scheduling	Pods request nvidia.com/gpu correctly
GPU validation	Smoke test passes with nvidia-smi
Container image	Image versions are pinned
Run metadata	Run ID, image digest, config hash, and dataset version are logged
Single-node training	Multi-GPU single-node run succeeds
Multi-node training	Distributed workers join successfully
NCCL readiness	Collective communication test passes
Checkpointing	Checkpoints are written and restored successfully
Observability	GPU, training, and Kubernetes metrics are visible
Alerts	Alerts exist for low utilization, restarts, pending pods, and checkpoint failures
Cost controls	Quotas, priority classes, and queueing are configured
Security	RBAC, Secrets, image scanning, and network policies are in place

Troubleshooting Multi-GPU LLM Training on Kubernetes

Symptom	Likely cause	Fix
Pod is stuck Pending	GPU quota exhausted or node affinity too strict	Check kubectl describe pod, quotas, and node labels
nvidia.com/gpu not visible	Device plugin or GPU Operator not running	Check GPU Operator pods and node capacity
Pod lands on CPU node	Missing nodeSelector, affinity, or taint/toleration	Add GPU node placement rules
nvidia-smi fails inside pod	Driver/runtime mismatch	Validate NVIDIA container runtime and driver stack
Only some distributed workers start	No gang scheduling or insufficient capacity	Use queueing or gang scheduling
NCCL timeout	Network interface, MTU, firewall, or cross-zone issue	Run NCCL test and validate network settings
GPU utilization is low	Slow dataset pipeline or CPU bottleneck	Tune dataloader, caching, CPU requests, and storage
OOM during optimizer step	Model, batch size, or optimizer state too large	Use gradient checkpointing, FSDP, ZeRO, or smaller batch
Checkpoint restore fails	Missing optimizer, scheduler, RNG, or sharding metadata	Standardize checkpoint contents and run restore drills
Training slows after scaling nodes	Network overhead exceeds compute gain	Profile communication and reduce cross-zone placement

Orchestration Options for Multi-GPU LLM Training

Option	Best when	Why it helps	Overhead
Native Kubernetes Job	Single-node or simple jobs	Easy to start	Low
PyTorch torchrun	PyTorch-native teams	Direct control over distributed launch	Low–Medium
Hugging Face Accelerate	Teams switching distributed backends	Simplifies launch configuration	Low–Medium
Kubeflow Trainer	Platform teams need repeatable distributed jobs	Provides Kubernetes-native training abstraction	Medium
Volcano	Batch scheduling and gang scheduling needs	Helps with queueing and all-or-nothing jobs	Medium–High
Kueue	Multi-team quota and queue management	Improves admission control and fairness	Medium
Horovod	Existing Horovod/HPC workflows	Supports multiple frameworks	Medium

Recommended default path:

Native Job → torchrun → Kubeflow Trainer → Kueue/Volcano for queueing and gang scheduling

Launch Your Multi-GPU LLM Training Faster with AceCloud

A production-ready multi-GPU LLM training pipeline is not just about spinning up GPUs. It is about repeatability, scheduling discipline, checkpoint resilience, observability, cost control, and secure operations.

AceCloud helps teams operationalize GPU-first AI infrastructure for training, fine-tuning, and inference workloads. Its AI/ML infrastructure page highlights multi-GPU and multi-node distributed training, checkpointing for long-running jobs, preconfigured PyTorch and Hugging Face environments, Kubernetes autoscaling, and MLOps support.

Whether you are moving from experimentation to production or standardizing GPU infrastructure for multiple teams, AceCloud can help you deploy reliable multi-GPU training environments faster.

CTA: Ready to scale beyond one GPU? Book your free consultation with our cloud GPU expert to deploy on AceCloud cloud GPUs and operationalize multi-GPU training on Kubernetes.

Frequently Asked Questions

How do I train LLMs with multiple GPUs on Kubernetes?

Enable GPUs so they are schedulable, then run distributed training using a launcher like torchrun or Accelerate. Add scheduling rules (taints, tolerations, affinity) so jobs land on the right GPU nodes and implement checkpointing for restart safety.

What is the best architecture for scalable LLM training?

Dedicated GPU node pools, GPU enablement (device plugin or GPU Operator), a standardized orchestrator (Jobs, Kubeflow Trainer and optionally Volcano for queues), shared checkpoint storage and end-to-end observability.

Should I use single-node or multi-node multi-GPU training?

Use single-node multi-GPU training when the model fits on one server, and you want simpler operations. Use multi-node training when you need more GPUs, faster wall-clock training, or larger memory capacity than one node can provide.

What is the role of NCCL in multi-GPU training?

NCCL handles high-performance GPU communication for distributed training. Multi-node training depends heavily on stable networking, consistent drivers, compatible CUDA/NCCL versions, and correct network interface configuration.

Why is my Kubernetes GPU pod stuck Pending?

Common reasons include insufficient GPUs, namespace quota limits, missing tolerations, incorrect node affinity, unavailable GPU node pools, or partial placement issues for distributed jobs. Start with kubectl describe pod and check node allocatable GPU capacity.

Why is GPU utilization low during training?

Low GPU utilization usually means the GPUs are waiting on something else. Common causes include slow dataloaders, small batch sizes, CPU starvation, remote storage latency, unsharded datasets, or excessive checkpoint overhead.

Should I use Kubeflow Trainer or plain Kubernetes Jobs?

Use plain Kubernetes Jobs for simple single-node training. Use Kubeflow Trainer when you need repeatable distributed training, cleaner worker coordination, and a platform-native abstraction for PyTorch distributed jobs.

What is gang scheduling and why does it matter?

Gang scheduling starts a group of pods together only when enough resources are available. This matters for distributed training because partial worker startup can waste GPUs and cause jobs to hang. Kubernetes v1.35 includes alpha gang scheduling support, and tools such as Volcano or Kueue can also help with queueing and admission control.

How often should I checkpoint LLM training jobs?

Use both time-based and step-based checkpointing. For long-running jobs, a checkpoint every 30–60 minutes is a practical starting point, but the right interval depends on checkpoint size, storage speed, preemption risk, and acceptable recompute cost.

Can I share GPUs between training jobs?

You can share GPUs using mechanisms such as MIG or time-slicing, but full GPU allocation is usually better for production LLM training. Time-slicing can improve utilization for lightweight workloads, but it does not provide the same memory or fault isolation as MIG.

Carolyn Weitz

author

Carolyn began her cloud career at a fast-growing SaaS company, where she led the migration from on-prem infrastructure to a fully containerized, cloud-native architecture using Kubernetes. Since then, she has worked with a range of companies from early-stage startups to global enterprises helping them implement best practices in cloud operations, infrastructure automation, and container orchestration. Her technical expertise spans across AWS, Azure, and GCP, with a focus on building scalable IaaS environments and streamlining CI/CD pipelines. Carolyn is also a frequent contributor to cloud-native open-source communities and enjoys mentoring aspiring engineers in the Kubernetes ecosystem.