Multi-GPU LLM training has become essential as teams move from small model experiments to larger fine-tuning and pretraining workloads. A single GPU often becomes a bottleneck because of memory limits, long training cycles, and slow iteration speed.
However, production-grade multi-GPU training is not just about requesting more GPUs. A reliable pipeline needs GPU-aware scheduling, reproducible containers, distributed launch coordination, checkpoint recovery, NCCL-ready networking, monitoring, cost controls, and security guardrails.
Kubernetes provides the operational foundation for this. It gives platform teams a consistent way to schedule GPU workloads, isolate teams, enforce quotas, connect with CI/CD, and standardize training jobs across environments.
In this guide, we will build a practical production pipeline for multi-GPU LLM training on Kubernetes using NVIDIA GPU Operator, PyTorch torchrun, Kubeflow Trainer, checkpoint storage, DCGM-based GPU monitoring, and scheduling policies.
Reference Architecture: What You Are Building
Use this as the target architecture for a production-ready training platform.
At a high level, the pipeline contains:
| Layer | What it does |
|---|---|
| GPU infrastructure | Dedicated GPU node pools with labels, taints, and GPU drivers |
| GPU enablement | NVIDIA GPU Operator or NVIDIA device plugin exposes GPUs to Kubernetes |
| Training runtime | PyTorch, Hugging Face Accelerate, Kubeflow Trainer, or another distributed launcher |
| Scheduling control | Taints, tolerations, affinity, queues, quotas, and gang scheduling where needed |
| Storage | Shared filesystem or object storage for datasets, checkpoints, and artifacts |
| Observability | GPU telemetry, training metrics, logs, traces, and cluster events |
| Governance | RBAC, Secrets, image scanning, namespace isolation, and cost policies |
The NVIDIA GPU Operator is useful because it can automate several GPU software components, including NVIDIA drivers, the Kubernetes device plugin, NVIDIA Container Toolkit, GPU Feature Discovery, and DCGM-based monitoring.
Prerequisites to Consider for Multi-GPU LLM Training
Before building the pipeline, align on these platform requirements.
| Requirement | Recommendation |
|---|---|
| Kubernetes cluster | Use a supported Kubernetes version with GPU node pools |
| Container runtime | Use containerd or another supported runtime configured for NVIDIA containers |
| GPU enablement | Use NVIDIA GPU Operator for production lifecycle management |
| Training framework | Start with PyTorch torchrun; standardize later on Kubeflow Trainer if needed |
| Storage | Use object storage or a shared filesystem for checkpoints |
| Networking | Validate low-latency east-west traffic before multi-node training |
| Observability | Deploy Prometheus, Grafana, DCGM Exporter, and centralized logs |
| Security | Use namespace RBAC, Secrets, image scanning, and least-privilege service accounts |
Single-Node vs Multi-Node Multi-GPU Training
Before implementing anything, decide which training pattern you actually need.
| Pattern | Use when | Trade-off |
|---|---|---|
| Single-node multi-GPU | The model fits on GPUs within one server | Simpler networking and easier debugging |
| Multi-node DDP | You need more GPUs than one node provides | Requires stronger network and launch coordination |
| FSDP / ZeRO | Model or optimizer state does not fit in GPU memory | More complex checkpointing and tuning |
| Tensor parallelism | Individual layers are too large for one GPU | Requires model-specific distributed strategy |
| Pipeline parallelism | Model can be split into sequential stages | Can introduce pipeline bubbles |
| Hybrid parallelism | Very large pretraining or frontier-scale workloads | Highest operational complexity |
Kubeflow Trainer supports PyTorch distributed training and can run DDP, FSDP, FSDP2, and other PyTorch-supported distributed algorithms.
A practical rollout path is:
- Validate one GPU.
- Validate one node with multiple GPUs.
- Validate multi-node distributed training.
- Add checkpoint restore testing.
- Add queueing, quotas, monitoring, and cost controls.
Step 1: Provision GPU Nodes and Isolate Them
Create a dedicated GPU node pool instead of mixing GPU and CPU workloads on the same nodes. Then label and taint GPU nodes so only approved workloads can land there.
Example labels:
kubectl label node gpu-node-1 accelerator=nvidia
kubectl label node gpu-node-1 gpu.model=a100
kubectl label node gpu-node-1 workload-type=gpu-training
kubectl label node gpu-node-1 topology.kubernetes.io/zone=zone-a Example taint:
kubectl taint nodes gpu-node-1 dedicated=gpu-training:NoSchedule Training workloads must include a matching toleration:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-training"
effect: "NoSchedule" Use node affinity to keep training pods on the intended GPU pool:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- gpu-training
- key: gpu.model
operator: In
values:
- a100 This avoids accidental placement on the wrong nodes and makes scheduling behavior easier to debug.
Step 2: Enable GPUs for Kubernetes Scheduling
Kubernetes uses device plugins to expose hardware resources such as GPUs to the kubelet. After the GPU plugin is installed, nodes advertise resources such as nvidia.com/gpu, and pods can request them in their container limits.
For production, prefer NVIDIA GPU Operator because it manages the GPU software lifecycle more completely. For a minimal setup, you can use the NVIDIA device plugin.
Validate GPU capacity
kubectl describe node gpu-node-1 | grep -A5 "Capacity"
kubectl describe node gpu-node-1 | grep -A5 "Allocatable" You should see something similar to:
nvidia.com/gpu: 8 Run a GPU smoke test
apiVersion: v1
kind: Pod
metadata:
name: gpu-smoke-test
spec:
restartPolicy: Never
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-training"
effect: "NoSchedule"
nodeSelector:
workload-type: gpu-training
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["bash", "-lc", "nvidia-smi && sleep 10"]
resources:
limits:
nvidia.com/gpu: 1 Apply and inspect logs:
kubectl apply -f gpu-smoke-test.yaml
kubectl logs gpu-smoke-test NOTE: Kubernetes GPU resources should be specified in limits. If requests are also specified, the request and limit values must match.
Step 3: Containerize Training for Reproducibility
Your training image should be treated as an immutable release artifact.
A basic Dockerfile can look like this:
FROM pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime
WORKDIR /workspace
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY train.py .
COPY configs/ ./configs/
ENV PYTHONUNBUFFERED=1
ENV TOKENIZERS_PARALLELISM=false
CMD ["python", "train.py"] Recommended image practices:
| Practice | Why it matters |
|---|---|
| Pin PyTorch, CUDA, Transformers, Accelerate, and NCCL-compatible versions | Reduces runtime drift |
| Tag images with git SHA | Makes every run traceable |
| Log image digest per run | Supports auditability |
| Externalize hyperparameters | Prevents hidden config drift |
| Store training config in Git | Enables review and rollback |
Example image tag:
docker build -t registry.example.com/llm-trainer:${GIT_SHA} .
docker push registry.example.com/llm-trainer:${GIT_SHA} For every training run, log:
run_id
git_commit
image_digest
dataset_version
config_hash
world_size
checkpoint_path Step 4: Run a Single-Node Multi-GPU Training Job
Start with single-node multi-GPU training before moving to multi-node. It is easier to debug and avoids distributed networking issues.
Example Kubernetes Job using PyTorch torchrun:
apiVersion: batch/v1
kind: Job
metadata:
name: llm-train-single-node
spec:
backoffLimit: 0
template:
metadata:
labels:
app: llm-training
run-id: run-2026-01-20-a
spec:
restartPolicy: Never
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-training"
effect: "NoSchedule"
nodeSelector:
workload-type: gpu-training
gpu.model: a100
containers:
- name: trainer
image: registry.example.com/llm-trainer:REPLACE_WITH_GIT_SHA
command:
- bash
- -lc
- |
torchrun \
--standalone \
--nproc_per_node=8 \
train.py \
--config configs/train.yaml \
--output_dir /checkpoints/run-2026-01-20-a
env:
- name: NCCL_DEBUG
value: "WARN"
- name: CUDA_DEVICE_MAX_CONNECTIONS
value: "1"
resources:
requests:
cpu: "32"
memory: "256Gi"
limits:
nvidia.com/gpu: 8
memory: "256Gi"
volumeMounts:
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: checkpoints
persistentVolumeClaim:
claimName: llm-checkpoints-pvc This pattern is useful when one node has enough GPU memory and compute for the workload. For example, one 8-GPU node is often easier to operate than an 8-GPU job spread across several nodes.
Step 5: Move to Multi-Node Training
Move to multi-node training only after single-node training is stable. Multi-node jobs add coordination, networking, rank assignment, failure handling, and NCCL tuning.
For production teams, use a controller such as Kubeflow Trainer instead of hand-rolling worker pods. Kubeflow Trainer automatically configures PyTorch distributed environment variables such as WORLD_SIZE, RANK, and LOCAL_RANK, and it uses torchrun for PyTorch distributed jobs.
Example Kubeflow Trainer-style Python workflow:
from kubeflow.trainer import TrainerClient, CustomTrainer
def train_func():
import os
import torch
import torch.distributed as dist
backend = "nccl" if torch.cuda.is_available() else "gloo"
dist.init_process_group(backend=backend)
local_rank = int(os.getenv("LOCAL_RANK", "0"))
torch.cuda.set_device(local_rank)
print({
"world_size": dist.get_world_size(),
"rank": dist.get_rank(),
"local_rank": local_rank,
})
# Import and call your actual training loop here.
# train_model(config_path="/workspace/configs/train.yaml")
client = TrainerClient()
client.train(
trainer=CustomTrainer(
func=train_func,
resources_per_node={
"gpu": 8,
"cpu": 32,
"memory": "256Gi",
},
num_nodes=2,
)
) Use multi-node training when:
| Signal | Interpretation |
|---|---|
| One node cannot fit the model | Use FSDP, ZeRO, tensor parallelism, or larger GPU memory |
| Single-node training is too slow | Scale workers across nodes |
| GPU utilization is high but wall-clock time is too long | Multi-node may help |
| GPU utilization is low | Fix data pipeline before adding nodes |
| NCCL tests are unstable | Do not scale yet |
Step 6: Add Gang Scheduling or Queueing for Large Jobs
Distributed training jobs often require all workers to start together. If only half the workers start, the job can hang while still holding expensive GPUs.
This is where gang scheduling or queue-based admission helps. Kubernetes v1.35 includes alpha gang scheduling support that schedules a group of pods on an all-or-nothing basis when the configured minimum pod count can be satisfied.
Use gang scheduling, Volcano, Kueue, or another batch-aware scheduler when:
| Problem | Why it matters |
|---|---|
| Partial worker placement | Wastes GPUs and blocks other jobs |
| Multiple teams submit large jobs | Requires fairness and queueing |
| Cluster has mixed GPU types | Needs policy-based placement |
| Spot capacity is used | Requires retry and checkpoint-aware scheduling |
| Long-running pretraining jobs | Needs predictable admission and preemption behavior |
For most teams, this rollout works well:
- Start with native Kubernetes Jobs.
- Move distributed jobs to Kubeflow Trainer.
- Add Kueue, Volcano, or gang scheduling when large multi-pod jobs create contention.
- Add priority classes and quotas for team-level fairness.
Step 7: Configure Checkpoint Storage and Restore
Checkpointing is not optional for production LLM training. Node failures, preemptions, networking issues, and job evictions are normal at scale.
A production checkpoint should include:
| Checkpoint item | Why it matters |
|---|---|
| Model weights | Required to resume training |
| Optimizer state | Prevents training instability after resume |
| Scheduler state | Keeps learning rate progression correct |
| Global step | Avoids duplicate or skipped steps |
| RNG state | Improves reproducibility |
| Tokenizer and model config | Prevents mismatch during restore |
| Parallelism strategy | Needed for FSDP, ZeRO, or sharded checkpoints |
| Dataset cursor or consumed samples | Prevents data replay errors |
Example PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-checkpoints-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Ti
storageClassName: fast-shared-storage Recommended checkpoint policy:
| Policy | Recommendation |
|---|---|
| Time-based | Save every 30–60 minutes |
| Step-based | Save every N global steps |
| Milestone-based | Save at evaluation milestones |
| Retention | Keep latest, best, and last-known-good checkpoints |
| Restore drill | Test restore before production launch |
A checkpoint is only useful if it can be restored. Add a staging restore drill to every major training pipeline release.
Step 8: Validate Networking and NCCL Readiness
Multi-node training often fails at the network layer before it fails at the model layer.
Before scaling, validate:
| Check | Why it matters |
|---|---|
| Same GPU model across workers | Reduces performance variance |
| Same driver, CUDA, and NCCL-compatible stack | Prevents hard-to-debug runtime issues |
| Low-latency east-west traffic | Required for collective communication |
| MTU consistency | Prevents packet fragmentation and throughput drops |
| Zone placement | Cross-zone collectives can be slower |
| NCCL test | Confirms collective communication before expensive training |
Useful NCCL environment variables:
env:
- name: NCCL_DEBUG
value: "WARN"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: TORCH_DISTRIBUTED_DEBUG
value: "DETAIL" Do not blindly copy these into every environment. Validate the correct network interface, transport, and driver behavior for your cluster.
Step 9: Build Observability into the Pipeline
Observability should be part of the training platform from day one.
Track three layers of metrics.
GPU metrics
| Metric | Why it matters |
|---|---|
| GPU utilization | Shows whether GPUs are actually busy |
| GPU memory used | Helps detect OOM risk |
| GPU power draw | Useful for cost and throttling analysis |
| Temperature | Detects thermal throttling |
| ECC or hardware errors | Detects failing GPUs |
Training metrics
| Metric | Why it matters |
|---|---|
| Step time | Shows training speed |
| Tokens per second | Best throughput signal for LLMs |
| Data loader time | Detects input bottlenecks |
| Loss and eval metrics | Tracks model quality |
| Checkpoint duration | Detects slow storage |
| Resume success rate | Validates failure recovery |
Kubernetes metrics
| Metric | Why it matters |
|---|---|
| Pod restarts | Detects instability |
| Pending pods | Shows scheduling pressure |
| Evictions | Detects node pressure |
| Node allocatable GPUs | Confirms GPU availability |
| PVC latency | Detects storage bottlenecks |
| Queue wait time | Shows scheduling efficiency |
Example Prometheus queries:
avg by (pod) (DCGM_FI_DEV_GPU_UTIL)
avg by (pod) (DCGM_FI_DEV_FB_USED)
increase(kube_pod_container_status_restarts_total[15m])
kube_pod_status_phase{phase="Pending"} Recommended alerts:
| Alert | Suggested trigger |
|---|---|
| Low GPU utilization | GPU utilization below 40% for 15 minutes |
| Checkpoint slowdown | Checkpoint duration above expected threshold |
| Pod restart loop | Restart count increases repeatedly |
| Training pod pending | Pending for more than 10 minutes |
| NCCL timeout | Error pattern found in logs |
| Node pressure | Memory, disk, or PID pressure on GPU nodes |
Step 10: Optimize Dataset Throughput
Many teams add GPUs before fixing the input pipeline. That usually increases cost without improving throughput.
Common dataset bottlenecks:
| Bottleneck | Fix |
|---|---|
| Slow object storage reads | Add caching or local staging |
| Too few data loader workers | Tune workers per GPU |
| Large unsharded files | Use sharded datasets |
| Cross-zone data access | Place data near GPU nodes |
| Checkpoint writes competing with dataset reads | Separate storage paths or backends |
Recommended practices:
- Use sharded datasets.
- Warm hot data to local NVMe when possible.
- Track data loader time separately from GPU compute time.
- Keep checkpoint writes from blocking training.
- Log samples/sec and tokens/sec for every run.
A simple rule: if GPU utilization is low, do not scale to more GPUs until the input pipeline is fixed.
Step 11: Add Cost Controls
Multi-GPU training can become expensive quickly. Cost controls should be enforced at the platform level, not left to individual teams.
Basic cost formula:
training_cost = GPU_hour_price × GPU_count × wall_clock_hours Example:
$3.00 per GPU-hour × 8 GPUs × 20 hours = $480 Cost levers:
| Lever | How it helps |
|---|---|
| ResourceQuota | Prevents teams from consuming unlimited GPUs |
| PriorityClass | Protects critical jobs |
| Queueing | Prevents partial starts and unfair usage |
| Checkpointing | Makes preemptible or spot capacity safer |
| Autoscaling | Reduces idle GPU nodes |
| Dataset caching | Improves utilization |
| Mixed precision | Reduces memory and improves speed |
| Right-sized CPU/memory | Prevents GPU starvation from CPU bottlenecks |
Example namespace quota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-training-quota
namespace: team-llm
spec:
hard:
requests.cpu: "512"
requests.memory: "4Ti"
limits.nvidia.com/gpu: "32" Example priority class:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: production-training
value: 100000
globalDefault: false
description: "Priority class for approved production LLM training jobs." Use lower priority for experiments and higher priority for approved production runs.
Step 12: Use GPU Sharing Carefully
GPU sharing can improve utilization, but it is not always appropriate for LLM training.
NVIDIA GPU time-slicing allows workloads to share GPU time, but unlike MIG, time-slicing does not provide memory or fault isolation between replicas.
| Mode | Best for | Avoid for |
|---|---|---|
| Full GPU | LLM training, fine-tuning, performance-sensitive jobs | Small dev tasks |
| MIG | Isolated inference, smaller training, multi-tenant workloads | Models needing full GPU memory |
| Time-slicing | Notebooks, CI tests, lightweight dev | Production LLM training |
| MPS | Some shared CUDA workloads | Strict isolation requirements |
For serious multi-GPU LLM training, full GPU allocation is usually the safest default.
Step 13: Secure the Training Platform
Security controls matter because training jobs often access private datasets, model weights, credentials, and internal infrastructure.
Baseline controls:
| Control | Recommendation |
|---|---|
| Namespace isolation | Separate teams and environments |
| RBAC | Use least-privilege service accounts |
| Secrets | Store credentials in Kubernetes Secrets or external secret managers |
| NetworkPolicy | Restrict unnecessary egress |
| Image scanning | Scan training images before deployment |
| Signed images | Enforce trusted images through admission policies |
| Audit logs | Track who launched jobs and with which config |
| Data access | Restrict dataset buckets by team and workload |
Example service account:
apiVersion: v1
kind: ServiceAccount
metadata:
name: llm-trainer
namespace: team-llm Example pod usage:
spec:
serviceAccountName: llm-trainer Do not run training jobs with broad cluster-admin permissions.
Production Readiness Checklist
Use this checklist before declaring the pipeline production-ready.
| Area | Check |
|---|---|
| GPU nodes | Dedicated GPU node pool exists |
| Labels and taints | GPU nodes are labeled and tainted |
| GPU scheduling | Pods request nvidia.com/gpu correctly |
| GPU validation | Smoke test passes with nvidia-smi |
| Container image | Image versions are pinned |
| Run metadata | Run ID, image digest, config hash, and dataset version are logged |
| Single-node training | Multi-GPU single-node run succeeds |
| Multi-node training | Distributed workers join successfully |
| NCCL readiness | Collective communication test passes |
| Checkpointing | Checkpoints are written and restored successfully |
| Observability | GPU, training, and Kubernetes metrics are visible |
| Alerts | Alerts exist for low utilization, restarts, pending pods, and checkpoint failures |
| Cost controls | Quotas, priority classes, and queueing are configured |
| Security | RBAC, Secrets, image scanning, and network policies are in place |
Troubleshooting Multi-GPU LLM Training on Kubernetes
| Symptom | Likely cause | Fix |
|---|---|---|
| Pod is stuck Pending | GPU quota exhausted or node affinity too strict | Check kubectl describe pod, quotas, and node labels |
| nvidia.com/gpu not visible | Device plugin or GPU Operator not running | Check GPU Operator pods and node capacity |
| Pod lands on CPU node | Missing nodeSelector, affinity, or taint/toleration | Add GPU node placement rules |
| nvidia-smi fails inside pod | Driver/runtime mismatch | Validate NVIDIA container runtime and driver stack |
| Only some distributed workers start | No gang scheduling or insufficient capacity | Use queueing or gang scheduling |
| NCCL timeout | Network interface, MTU, firewall, or cross-zone issue | Run NCCL test and validate network settings |
| GPU utilization is low | Slow dataset pipeline or CPU bottleneck | Tune dataloader, caching, CPU requests, and storage |
| OOM during optimizer step | Model, batch size, or optimizer state too large | Use gradient checkpointing, FSDP, ZeRO, or smaller batch |
| Checkpoint restore fails | Missing optimizer, scheduler, RNG, or sharding metadata | Standardize checkpoint contents and run restore drills |
| Training slows after scaling nodes | Network overhead exceeds compute gain | Profile communication and reduce cross-zone placement |
Orchestration Options for Multi-GPU LLM Training
| Option | Best when | Why it helps | Overhead |
|---|---|---|---|
| Native Kubernetes Job | Single-node or simple jobs | Easy to start | Low |
| PyTorch torchrun | PyTorch-native teams | Direct control over distributed launch | Low–Medium |
| Hugging Face Accelerate | Teams switching distributed backends | Simplifies launch configuration | Low–Medium |
| Kubeflow Trainer | Platform teams need repeatable distributed jobs | Provides Kubernetes-native training abstraction | Medium |
| Volcano | Batch scheduling and gang scheduling needs | Helps with queueing and all-or-nothing jobs | Medium–High |
| Kueue | Multi-team quota and queue management | Improves admission control and fairness | Medium |
| Horovod | Existing Horovod/HPC workflows | Supports multiple frameworks | Medium |
Recommended default path:
Native Job → torchrun → Kubeflow Trainer → Kueue/Volcano for queueing and gang scheduling
Launch Your Multi-GPU LLM Training Faster with AceCloud
A production-ready multi-GPU LLM training pipeline is not just about spinning up GPUs. It is about repeatability, scheduling discipline, checkpoint resilience, observability, cost control, and secure operations.
AceCloud helps teams operationalize GPU-first AI infrastructure for training, fine-tuning, and inference workloads. Its AI/ML infrastructure page highlights multi-GPU and multi-node distributed training, checkpointing for long-running jobs, preconfigured PyTorch and Hugging Face environments, Kubernetes autoscaling, and MLOps support.
Whether you are moving from experimentation to production or standardizing GPU infrastructure for multiple teams, AceCloud can help you deploy reliable multi-GPU training environments faster.
CTA: Ready to scale beyond one GPU? Book your free consultation with our cloud GPU expert to deploy on AceCloud cloud GPUs and operationalize multi-GPU training on Kubernetes.
Frequently Asked Questions
Enable GPUs so they are schedulable, then run distributed training using a launcher like torchrun or Accelerate. Add scheduling rules (taints, tolerations, affinity) so jobs land on the right GPU nodes and implement checkpointing for restart safety.
Dedicated GPU node pools, GPU enablement (device plugin or GPU Operator), a standardized orchestrator (Jobs, Kubeflow Trainer and optionally Volcano for queues), shared checkpoint storage and end-to-end observability.
Use single-node multi-GPU training when the model fits on one server, and you want simpler operations. Use multi-node training when you need more GPUs, faster wall-clock training, or larger memory capacity than one node can provide.
NCCL handles high-performance GPU communication for distributed training. Multi-node training depends heavily on stable networking, consistent drivers, compatible CUDA/NCCL versions, and correct network interface configuration.
Common reasons include insufficient GPUs, namespace quota limits, missing tolerations, incorrect node affinity, unavailable GPU node pools, or partial placement issues for distributed jobs. Start with kubectl describe pod and check node allocatable GPU capacity.
Low GPU utilization usually means the GPUs are waiting on something else. Common causes include slow dataloaders, small batch sizes, CPU starvation, remote storage latency, unsharded datasets, or excessive checkpoint overhead.
Use plain Kubernetes Jobs for simple single-node training. Use Kubeflow Trainer when you need repeatable distributed training, cleaner worker coordination, and a platform-native abstraction for PyTorch distributed jobs.
Gang scheduling starts a group of pods together only when enough resources are available. This matters for distributed training because partial worker startup can waste GPUs and cause jobs to hang. Kubernetes v1.35 includes alpha gang scheduling support, and tools such as Volcano or Kueue can also help with queueing and admission control.
Use both time-based and step-based checkpointing. For long-running jobs, a checkpoint every 30–60 minutes is a practical starting point, but the right interval depends on checkpoint size, storage speed, preemption risk, and acceptable recompute cost.
You can share GPUs using mechanisms such as MIG or time-slicing, but full GPU allocation is usually better for production LLM training. Time-slicing can improve utilization for lightweight workloads, but it does not provide the same memory or fault isolation as MIG.