Fix GPU Training Bottlenecks with PyTorch Profiler and Nsight Compute

Jason Karlin

Last Updated: Jan 2, 2026

10 Minute Read

859 Views

Fix GPU Training Bottlenecks with PyTorch Profiler and Nsight Compute

Nsight Compute and PyTorch Profiler are the fastest way to turn a sluggish GPU training run into a measurable and fixable system. The goal is not higher utilization in a dashboard, but steady step time and lower convergence costs.

Start at the macro level: PyTorch Profiler exposes pipeline stalls across the dataloader, CPU work, autograd spans and host-to-device transfers.
Then zoom in: Nsight Compute explains the hot CUDA kernels with metrics that reveal memory bandwidth limits, warp stalls, occupancy constraints and Tensor Core usage.

With transformers scaling across clusters of H100 and Blackwell-class GPUs, small gaps become expensive quickly. This tiered workflow replaces trial-and-error tuning with evidence, letting you prioritize the fixes that move throughput and time-to-train without rewriting your entire stack or model today.

PyTorch’s reach is already mainstream, leading model training at 63% adoption in a Linux Foundation survey. That’s why this workflow is worth learning because it’s built on reusable tools and practices across modern training stacks, not a one-off tuning trick.

What are GPU Training Bottlenecks?

A GPU bottleneck is often not a single thing. It is a mismatch between how fast your pipeline can feed work and how fast the GPU can consume it.

Common bottleneck categories you will see in real training runs:

Input and preprocessing bottlenecks: Dataloader workers cannot keep up, expensive CPU transforms, slow storage or insufficient overlap between loading and training.
Transfer bottlenecks: Host-to-device copies are frequent, synchronous (blocking) or not using pinned/page-locked memory; the GPU waits for the next batch instead of overlapping copies with compute.
Kernel bottlenecks: Kernels are memory-bound, have low occupancy, are dominated by stalls, or the model launches many very small kernels from Python, where kernel launch overhead and interpreter overhead dominate step time.
Synchronization bottlenecks: Hidden sync points force the CPU to wait for the GPU (or vice versa), collapsing overlap.

IDC’s Q3 2025 tracker coverage reported $112.4B in server revenue (up 61% YoY) and noted servers with embedded GPUs contributed over half of server market revenue.

When Should You Use Nsight Compute vs PyTorch Profiler?

Below is a side-by-side comparison table that will help you choose the right profiler tools.

Feature	PyTorch Profiler	NVIDIA Nsight Compute
Primary Focus	PyTorch operations, Python-level bottlenecks, CPU/GPU task correlation	Detailed CUDA kernel performance metrics (hardware counters, SASS, PTX, source correlation)
Level of Detail	High-level (e.g., aten::pow, torch.nn.Linear)	Low-level (e.g., memory access patterns, cache hit rates, occupancy, instruction metrics)
Output Format	JSON trace files (Chrome/Perfetto trace format, viewable in TensorBoard, Chrome Trace Viewer or Perfetto), in-console tables	Interactive GUI, CLI reports (.ncu-rep files), guided analysis and rule sets
Usage	Integrated directly into PyTorch code using a Python API or context manager	Launched via command line or a standalone GUI that launches/attaches to the application
Overhead	Noticeable during profiling due to detailed data collection, but can be limited via schedules (wait/warmup/active) and selective activities	Relatively high per-kernel when many metrics/sections are enabled, so it is typically used for short, targeted kernel analysis rather than long whole-application runs

Key Takeaways:

Use PyTorch Profiler when you need a macro view of where time goes across the training step. It answers, “Which step or operator is slow?” by separating CPU time from CUDA time, grouping by operator and showing autograd phases. That visibility matters because many performance bottlenecks live in Python overhead, dataloading or synchronization rather than a single kernel.
Use Nsight Compute when you need a micro view of why a specific CUDA kernel is slow. It answers, “Why is this kernel slow?” by exposing warp stall reasons, memory throughput, cache efficiency, achieved occupancy and Tensor Core related metrics. These details matter because kernel tuning decisions should be driven by measured limits, not intuition.

Rule of thumb, described in plain terms:

If you need to trace views and end-to-end step visibility, you should start with PyTorch Profiler.
If you need kernel optimization evidence, you should move to Nsight Compute after you isolate the hotspot.

Note: PyTorch Profiler gives high-level visibility into training steps and autograd spans. Whereas Nsight Compute is ideal for analyzing low-level kernel activity. Combining both tools gives you macro-to-micro insight into performance inefficiencies.

How to Profile GPU Training Pipeline Step-by-Step?

Below is a practical playbook that you can apply to most GPU training jobs.

Step 1:Establish a stable baseline (before any profiling)

Before you profile, measure a small, repeatable slice:

Pick a fixed batch size and a fixed number of steps (for example, 200).
Report median and p95 step time.
Ensure warmup is excluded (first N steps).

You need this baseline because both profilers add overhead and can change timing.

Step 2: Capture step-level traces with PyTorch Profiler

Use a scheduled capture, so you do not accidentally profile warmup artifacts.

import torch 

with torch.profiler.profile( 

    activities=[ 

        torch.profiler.ProfilerActivity.CPU, 

        torch.profiler.ProfilerActivity.CUDA, 

    ], 

    schedule=torch.profiler.schedule(wait=2, warmup=2, active=6, repeat=1), 

    on_trace_ready=torch.profiler.tensorboard_trace_handler("./tb"), 

    record_shapes=True, 

    profile_memory=True, 

    with_stack=False,   # enable only when needed 

) as prof: 

    for step, batch in enumerate(loader): 

        loss = train_step(batch) 

        prof.step()

Why this structure matters:

Wait/ warmup/ active reduces skew from profiler startup overhead and keeps the trace focused.
Tensorboard_trace_handler writes traces you can inspect as trace views in TensorBoard.

What to look for in the trace:

GPU gaps: Long stretches where CUDA lanes are idle while CPU work continues.

Dominant ops: Attention, matmul, layernorm, embedding ops often stand out.

Transfer patterns: Frequent copies, blocking transfers or copy bursts.

Step 3: Add NVTX ranges so “mystery kernels” become readable

Once you have a hotspot, label phases so kernels group under meaningful names.

import torch   

torch.cuda.nvtx.range_push("forward") 

out = model(x) 

torch.cuda.nvtx.range_pop()   

torch.cuda.nvtx.range_push("backward") 

loss = loss_fn(out, y) 

loss.backward() 

torch.cuda.nvtx.range_pop()   

torch.cuda.nvtx.range_push("optimizer") 

opt.step() 

opt.zero_grad(set_to_none=True) 

torch.cuda.nvtx.range_pop()

PyTorch exposes NVTX range push/pop, and NVTX push/pop ranges are designed to annotate nested spans on a thread.

These annotations are especially helpful when you later profile with Nsight Systems or Nsight Compute, because NVTX ranges show up as named regions (for example, “forward”, “backward”, “optimizer”) and make it easier to filter or replay just those spans.

Step 4: Zoom into the hot kernels with Nsight Compute (ncu)

Once PyTorch Profiler points to a slow op or phase, switch to Nsight Compute to explain kernel behavior.

A simple CLI flow:

ncu --target-processes all --set basic -o profile_basic python train.py

Notes that matter in practice:

ncu can write a report file for later inspection in the UI.
Nsight Compute may need to replay kernels multiple times to collect all requested data, which is why you should profile short, representative windows.
If you want to profile NVTX-defined ranges, Nsight Compute supports replay modes where ranges are defined by NVTX expressions (range replay / app-range).

This is also the clean answer to: “Is Nsight better than PyTorch Profiler?” It is deeper, not better. Use it after you have a target.

Step 5: Interpret Nsight Compute results using a small set of “first metrics”

You do not need to become a GPU microarchitecture expert to get value quickly. Start with questions:

Is the kernel memory-bound? Look at memory throughput and cache behavior.
Is it latency bound? Warp stall reasons can hint at waiting on memory or dependencies.
Is occupancy limiting you? Compare achieved versus theoretical occupancy and check whether resource limits (registers, shared memory) are constraining you.

Nsight Compute includes an Occupancy Calculator to help reason for occupancy for a given kernel configuration.
When you see high “Long Scoreboard” stalls, interpret it as warps waiting on long-latency operations (typically memory). Validate this by checking DRAM throughput, cache hit rates and L2 utilization.

Step 6: Apply fixes based on what you actually measured

This is where “CUDA kernel optimization” becomes concrete:

If you are input-bound, fix the input pipeline first (below).
If you are transfer-bound, fix pinned memory and async transfers.
If you are kernel-bound, focus on memory locality, fusion opportunities, and reducing bandwidth pressure, then re-profile.

Kernel-side fixes to mention (kept intentionally practical, not academic)

If kernels are memory-bound, prioritize reducing memory traffic (layout, fusion, fewer reads/writes).
If you see many tiny kernels, reduce fragmentation (fuse ops where possible, cut Python overhead).
If you expected Tensor Cores but don’t see the speedup, verify mixed precision paths and actual kernel mix (profilers reveal this quickly).

Step 7: Fix data throughput when the GPU is not the problem

If PyTorch Profiler shows GPU gaps and the CPU is busy, you are often dataloader-bound.

Start with the proven knobs:

Set num_workers > 0 to overlap training and data loading (tune per workload).
Use pin_memory=True to enable faster host-to-GPU transfers from pinned memory.
Use non_blocking=True for transfers where appropriate, but do not assume every “pin then non_block” pattern is faster; PyTorch’s tutorial shows there are cases where specific sequences can be slower.

A simple pattern:

for batch in loader: 

    x, y = batch 

    x = x.to("cuda", non_blocking=True) 

    y = y.to("cuda", non_blocking=True) 

    train_step((x, y))

How to Debug Distributed Training Bottlenecks (NCCL, Overlap, Stager)?

If training spans multiple GPUs (DDP) or nodes, the bottleneck is often communication or imbalance, not a single kernel.

What to look for:

All-reduce/all-gather dominates step time: comm-bound scaling
Some ranks lag behind: stragglers (data skew, CPU contention, slow storage, or imbalance)
Poor overlap of comm with compute: synchronization or scheduling issues

How to profile effectively:

Keep using PyTorch Profiler for step-level structure, but ensure each worker writes separate traces. PyTorch’s tensorboard_trace_handler supports a per-worker name for distributed scenarios.
Use Nsight Compute selectively (one rank, short window) to avoid multiplying overhead across all ranks.

Practical tip: Nsight Compute CLI includes guidance/examples for multi-process/MPI profiling and how to profile a single rank via a wrapper script.

Turn GPU Bottlenecks into FasterTraining with AceCloud

Nsight Compute PyTorch Profiler provides a practical, evidence-based way to eliminate GPU training bottlenecks without guessing. PyTorch Profiler trace views reveal whether stalls come from dataloaders, CPU work, transfers or synchronization points.

Nsight Compute then validates kernel-level limits using NVIDIA profiling tools, including metrics for bandwidth pressure, warp stalls, occupancy constraints, and Tensor Core usage.

This tiered workflow keeps profiling GPU workloads focused, because each change is tied to measurable step time improvement.

AceCloud makes it easier to apply these wins at scale on GPU-first infrastructure with on-demand and spot NVIDIA GPUs and a 99.99%* uptime SLA.

You can run both Nsight tools and PyTorch Profiler directly on AceCloud instances, capture traces to attached NVMe or object storage, and reuse the same profiling workflow across A100, H100 and H200 clusters as your hardware mix evolves.

Run the same baseline, re-profile after each change, then deploy the optimized training job on AceCloud to reduce time-to-train and control spend.

Frequently Asked Questions

What causes GPU bottlenecks in training pipelines?

Common causes include input stalls, sync points, memory bandwidth limits, many small kernels and non-overlapped transfers. Each cause limits throughput differently, which is why traces are needed.

How can I profile CUDA training steps without drowning in data?

You should start at the step level with PyTorch Profiler to locate the hotspot, then zoom into the kernel with Nsight Compute.

Is Nsight Compute better than PyTorch Profiler?

Nsight Compute is not better, it is deeper. PyTorch Profiler tells you where time goes, while Nsight Compute explains why a kernel behaves that way.

What should I look at first in Nsight Compute?

You should start with top kernels, then compare memory throughput signals to compute throughput signals and review the top warp stall reasons.

What’s the fastest first win for low GPU utilization?

You should validate whether you are input-bound in the dataloader and transfers before you tune kernels, because input stalls can dominate step time.

How should I profile multi-GPU training without multiplying overhead?

Profile one rank, narrow the window using kernel/NVTX filters, and write per-worker traces when using PyTorch Profiler in distributed runs.

Jason Karlin

author

Industry veteran with over 10 years of experience architecting and managing GPU-powered cloud solutions. Specializes in enabling scalable AI/ML and HPC workloads for enterprise and research applications. Former lead solutions architect for top-tier cloud providers and startups in the AI infrastructure space.