Compute Bound Archives

A

Activation Memory Pressure

Memory-bound behavior due to large activation storage in ML.

Algorithmic Bottleneck

Performance limited by algorithm structure rather than hardware.

AllReduce Bandwidth Bound

Distributed training limited by collective communication bandwidth.

Amdahl’s Law

Theoretical limit on speedup due to serial portions of a workload.

Arithmetic Intensity

Another term for operational intensity, commonly used in HPC literature.

Asynchronous Execution

Overlapping compute with memory or I/O operations.

Attainable Performance

Realistic performance ceiling achieved under practical constraints.

B

Back-End Bound

CPU limited by execution units or memory access.

Balance Point

The point where compute capability and memory bandwidth are optimally matched.

Bandwidth Saturation

State where memory bandwidth is fully utilized.

Bandwidth Underutilization

Memory-bound workload failing to use available bandwidth, often latency-bound.

Bandwidth-Bound

A workload limited by maximum data transfer rate rather than access latency.

Batch Size Effect

Larger batches increase OI and reduce memory-bound behavior.

Branch Misprediction

Incorrect branch prediction causing pipeline flushes and lost cycles.

Bytes per FLOP

Inverse of operational intensity, indicating memory traffic per unit of computation.

C

Cache Blocking / Tiling

Reorganizing computation to improve data reuse.

Cache Miss

Access to data not found in cache, requiring slower memory access.

Cache Miss Penalty

Performance loss caused by fetching data from lower memory levels.

Cache Reuse Distance

A measure of how many distinct memory locations are accessed between two uses of the same location; used to determine whether data is likely to remain in a given cache level and thus whether a workload will be cache-bound or DRAM-bound.

Cache-Bound

A workload whose performance is limited primarily by the capacity or bandwidth of CPU/GPU caches (L1/L2/L3), even though external memory bandwidth may not be fully utilized. Typical in workloads with large working sets that barely fit or repeatedly thrash specific cache levels.

Code Balance

Ratio of data movement to computation in a specific implementation.

Collective Operation Bottleneck

Performance dominated by barriers, broadcasts, or reductions.

Communication-Bound

Distributed workload limited by inter-node data exchange.

Compute Optimization

Improving performance by increasing parallelism or instruction throughput.

Compute Roof

Maximum achievable performance limited by compute capability.

Compute Throughput

The rate at which a processor executes operations, typically measured in FLOPS or instructions per second.

Compute Underutilization

Compute-bound workload failing to use available compute resources.

Compute Utilization

Percentage of available compute resources actively used.

Compute-Bound

A workload whose performance is limited by CPU or accelerator compute capacity rather than data movement.

Cost Inefficiency Ratio

Mismatch between spend and delivered performance.

Cost-Bound Workload

Performance intentionally capped to control cloud spend.

CPU-Bound

Compute-bound workload limited specifically by CPU execution capacity.

D

Data Locality

Keeping data close to compute to reduce access cost.

Dependency Chain

Serial instruction dependencies limiting parallel execution.

Distributed Compute-Bound

Performance limited by aggregate compute across nodes.

Distributed Memory-Bound

Performance limited by data movement across nodes.

DRAM-Bound

A memory-bound scenario where performance is constrained by off-chip DRAM bandwidth or latency, even when cache behavior is tuned, often observed as high DRAM bandwidth utilization with modest core utilization.

E

Empirical Measurement

Using runtime profiling to determine real bottlenecks.

Execution Port Saturation

Compute limitation due to overuse of specific execution ports.

F

False Compute-Bound Classification

Misidentifying a memory- or latency-bound workload as compute-bound.

False Memory-Bound Classification

Misidentifying a compute-bound workload as memory-bound.

Front-End Bound

CPU limited by instruction fetch or decode stages.

G

GPU Occupancy

Number of active warps available to hide latency.

GPU-Bound

Compute-bound workload limited by GPU execution units.

Gradient Computation Bound

Training phase dominated by compute rather than memory.

H

Hardware Bottleneck

Performance limited by physical resource constraints.

HBM Bandwidth Saturation

GPU workload limited by high-bandwidth memory throughput.

Host-to-Device Transfer Bottleneck

ML workload limited by CPU-GPU data transfer.

I

I/O-Bound

Performance limited by disk or storage access.

Inference Memory-Bound

Inference workloads limited by memory access rather than math.

Input Pipeline Bottleneck

ML workload limited by data loading rather than compute.

Instance Imbalance

VM types with poor compute-to-memory ratios.

Instruction Scheduling

Reordering instructions to hide latency.

Instruction Throughput

Rate at which instructions are executed by a processor.

Instruction-Cache-Bound

A front-end–bound condition where the instruction cache or instruction TLB cannot supply decoded instructions fast enough, typically due to large code footprints, frequent branching, or poor locality in hot paths.

Instruction-Level Parallelism (ILP)

Ability to execute multiple independent instructions simultaneously.

J

K

Kernel Fusion

Combining operations to reduce memory traffic.

Knee Point

Intersection where a workload transitions from memory-bound to compute-bound.

L

Last-Level Cache (LLC) Contention

A performance bottleneck where multiple cores or processes compete for shared last-level cache capacity or bandwidth, causing elevated miss rates and making workloads appear memory-bound even when DRAM is not saturated.

Latency Hiding

Performing useful work while waiting on memory or I/O.

Latency vs Bandwidth

Latency measures delay per access, while bandwidth measures sustained data transfer capacity.

Latency-Bound

A workload limited by memory access delay rather than memory bandwidth.

Load Imbalance (Distributed)

A scaling bottleneck where some threads, processes, or nodes finish work much later than others, causing overall throughput to be limited by the slowest participants rather than raw compute or memory limits.

Load-Store Ratio

Ratio of memory operations to compute operations.

Loop Unrolling

Increasing ILP by reducing loop control overhead.

M

Machine Balance

Ratio of peak compute performance to peak memory bandwidth for a system.

Memory Access Pattern

Sequence and regularity of memory accesses.

Memory Bandwidth

The maximum rate at which data can be transferred between memory and compute units.

Memory Latency

The time required to fetch data from memory after a request is issued.

Memory Optimization

Improving performance by reducing data movement or improving locality.

Memory Roof

Maximum achievable performance limited by memory bandwidth.

Memory Stall

Compute idle time caused by waiting for memory access.

Memory Thrashing

Severe slowdown caused by constant cache or page replacement.

Memory-Bound

A workload whose performance is limited by memory access latency or bandwidth rather than compute power.

Memory-Level Parallelism (MLP)

Ability to issue multiple memory requests concurrently.

N

Network-Bound

Performance limited by network latency or bandwidth.

Non-Blocking Cache

Cache that can serve other requests while handling a miss.

NUMA Effect

Performance impact of accessing memory attached to a remote CPU socket.

NVLink Bottleneck

Interconnect bandwidth limiting multi-GPU performance.

O

Occupancy-Limited Kernel

GPU kernel limited by register or shared memory usage.

Operational Intensity (OI)

The ratio of computation performed per byte of data moved.

Operational Intensity Shift

Change in OI due to algorithmic or code optimization.

Outstanding Memory Requests

Number of in-flight memory operations allowed by hardware.

Overprovisioned Compute

Paying for compute when the workload is memory-bound.

P

Page Walk Overhead

Performance cost of virtual memory address translation.

PCIe Bottleneck

CPU-GPU data transfer limiting performance.

Performance Bottleneck

The resource that limits overall system performance regardless of optimization elsewhere.

Performance per Dollar

Measuring achieved performance relative to cloud cost.

Pipeline Stall

Idle cycles caused by dependencies or waiting on data.

Pipeline Utilization

Degree to which execution pipelines are kept busy.

Prefetch Distance Tuning

The practice of adjusting how far ahead software or hardware prefetches data (in iterations or bytes) so that prefetched lines arrive just in time for use, avoiding both late prefetches (latency-bound behavior) and overly early prefetches (cache pollution).

Prefetching

Loading data into cache before it is needed.

Profiler Sampling Bias

Skewed conclusions due to limited or noisy profiling data.

Q

R

Random Access

Irregular memory access causing poor cache utilization.

Register Spilling

Extra memory traffic caused by insufficient registers.

Remote Memory Access

Higher-latency memory access in NUMA systems.

Roofline Model

A performance model that determines whether workloads are compute-bound or memory-bound based on OI.

Roofline Ridge Point

The operational intensity value at which the memory roofline and compute roofline intersect; workloads below this point are memory-bound, while those above it are compute-bound on a given hardware platform.

Roofline-Guided Optimization

Using Roofline analysis to decide compute vs memory optimizations.

S

Shared-Memory-Bound (GPU)

A GPU bottleneck where performance is constrained by shared-memory capacity or bandwidth (or bank conflicts), rather than global memory or tensor/FP32 compute throughput, common in highly optimized stencil and block-tiled kernels.

Single-Thread Bottleneck

Compute-bound behavior caused by lack of parallelism.

Spatial Locality

Accessing nearby memory locations together.

Spec-Sheet Fallacy

Assuming peak hardware specs reflect real workload behavior.

Straggler Effect

The phenomenon where a small number of slow tasks or workers (“stragglers”) dominate end-to-end job completion time, making a system effectively bound by the slowest replica rather than average compute or memory performance.

Strided Access

Regular memory access with fixed spacing that may reduce cache efficiency.

Strided vs. Streaming Access Classification

A categorization of memory access patterns where streaming (sequential) access is designed to reach near-peak bandwidth, while strided or sparse patterns often yield lower effective bandwidth and turn a nominally bandwidth-bound workload into a latency-bound one.

Strong Scaling Limit

Point where adding compute no longer improves performance.

Synchronization Overhead

Waiting time caused by coordination between parallel tasks.

T

Temporal Locality

Reusing data within short time intervals.

Tensor Core Utilization

The proportion of time or FLOPs during which specialized tensor units (e.g., NVIDIA Tensor Cores) are actively executing useful math, used to determine whether a model is effectively using available matrix-multiply hardware or is limited by memory or scalar pipelines instead.

Tensor-Bound

Compute-bound workload limited by tensor cores in ML accelerators.

TLB Miss

Miss in address translation cache causing additional latency.

U

Underprovisioned Memory

Memory-bound behavior caused by insufficient RAM.

V

Vector Width

Number of data elements processed per vector instruction.

Vector-Bound

Compute-bound behavior caused by limited SIMD or vector utilization.

Vectorization

Using SIMD instructions to increase compute throughput.

W

Weak Scaling Efficiency

Ability to maintain performance as workload size grows.

Working Set Size

Amount of data actively accessed by a workload.

X

Y

Z

Compute Bound Glossary