Compute Bound Glossary
Memory-bound behavior due to large activation storage in ML.
Performance limited by algorithm structure rather than hardware.
Distributed training limited by collective communication bandwidth.
Theoretical limit on speedup due to serial portions of a workload.
Another term for operational intensity, commonly used in HPC literature.
Overlapping compute with memory or I/O operations.
Realistic performance ceiling achieved under practical constraints.
CPU limited by execution units or memory access.
The point where compute capability and memory bandwidth are optimally matched.
State where memory bandwidth is fully utilized.
Memory-bound workload failing to use available bandwidth, often latency-bound.
A workload limited by maximum data transfer rate rather than access latency.
Larger batches increase OI and reduce memory-bound behavior.
Incorrect branch prediction causing pipeline flushes and lost cycles.
Inverse of operational intensity, indicating memory traffic per unit of computation.
Reorganizing computation to improve data reuse.
Access to data not found in cache, requiring slower memory access.
Performance loss caused by fetching data from lower memory levels.
A measure of how many distinct memory locations are accessed between two uses of the same location; used to determine whether data is likely to remain in a given cache level and thus whether a workload will be cache-bound or DRAM-bound.
A workload whose performance is limited primarily by the capacity or bandwidth of CPU/GPU caches (L1/L2/L3), even though external memory bandwidth may not be fully utilized. Typical in workloads with large working sets that barely fit or repeatedly thrash specific cache levels.
Ratio of data movement to computation in a specific implementation.
Performance dominated by barriers, broadcasts, or reductions.
Distributed workload limited by inter-node data exchange.
Improving performance by increasing parallelism or instruction throughput.
Maximum achievable performance limited by compute capability.
The rate at which a processor executes operations, typically measured in FLOPS or instructions per second.
Compute-bound workload failing to use available compute resources.
Percentage of available compute resources actively used.
A workload whose performance is limited by CPU or accelerator compute capacity rather than data movement.
Mismatch between spend and delivered performance.
Performance intentionally capped to control cloud spend.
Compute-bound workload limited specifically by CPU execution capacity.
Keeping data close to compute to reduce access cost.
Serial instruction dependencies limiting parallel execution.
Performance limited by aggregate compute across nodes.
Performance limited by data movement across nodes.
A memory-bound scenario where performance is constrained by off-chip DRAM bandwidth or latency, even when cache behavior is tuned, often observed as high DRAM bandwidth utilization with modest core utilization.
Using runtime profiling to determine real bottlenecks.
Compute limitation due to overuse of specific execution ports.
Misidentifying a memory- or latency-bound workload as compute-bound.
Misidentifying a compute-bound workload as memory-bound.
CPU limited by instruction fetch or decode stages.
Number of active warps available to hide latency.
Compute-bound workload limited by GPU execution units.
Training phase dominated by compute rather than memory.
Performance limited by physical resource constraints.
GPU workload limited by high-bandwidth memory throughput.
ML workload limited by CPU-GPU data transfer.
Performance limited by disk or storage access.
Inference workloads limited by memory access rather than math.
ML workload limited by data loading rather than compute.
VM types with poor compute-to-memory ratios.
Reordering instructions to hide latency.
Rate at which instructions are executed by a processor.
A front-end–bound condition where the instruction cache or instruction TLB cannot supply decoded instructions fast enough, typically due to large code footprints, frequent branching, or poor locality in hot paths.
Ability to execute multiple independent instructions simultaneously.
Combining operations to reduce memory traffic.
Intersection where a workload transitions from memory-bound to compute-bound.
A performance bottleneck where multiple cores or processes compete for shared last-level cache capacity or bandwidth, causing elevated miss rates and making workloads appear memory-bound even when DRAM is not saturated.
Performing useful work while waiting on memory or I/O.
Latency measures delay per access, while bandwidth measures sustained data transfer capacity.
A workload limited by memory access delay rather than memory bandwidth.
A scaling bottleneck where some threads, processes, or nodes finish work much later than others, causing overall throughput to be limited by the slowest participants rather than raw compute or memory limits.
Ratio of memory operations to compute operations.
Increasing ILP by reducing loop control overhead.
Ratio of peak compute performance to peak memory bandwidth for a system.
Sequence and regularity of memory accesses.
The maximum rate at which data can be transferred between memory and compute units.
The time required to fetch data from memory after a request is issued.
Improving performance by reducing data movement or improving locality.
Maximum achievable performance limited by memory bandwidth.
Compute idle time caused by waiting for memory access.
Severe slowdown caused by constant cache or page replacement.
A workload whose performance is limited by memory access latency or bandwidth rather than compute power.
Ability to issue multiple memory requests concurrently.
Performance limited by network latency or bandwidth.
Cache that can serve other requests while handling a miss.
Performance impact of accessing memory attached to a remote CPU socket.
Interconnect bandwidth limiting multi-GPU performance.
GPU kernel limited by register or shared memory usage.
The ratio of computation performed per byte of data moved.
Change in OI due to algorithmic or code optimization.
Number of in-flight memory operations allowed by hardware.
Paying for compute when the workload is memory-bound.
Performance cost of virtual memory address translation.
CPU-GPU data transfer limiting performance.
The resource that limits overall system performance regardless of optimization elsewhere.
Measuring achieved performance relative to cloud cost.
Idle cycles caused by dependencies or waiting on data.
Degree to which execution pipelines are kept busy.
The practice of adjusting how far ahead software or hardware prefetches data (in iterations or bytes) so that prefetched lines arrive just in time for use, avoiding both late prefetches (latency-bound behavior) and overly early prefetches (cache pollution).
Loading data into cache before it is needed.
Skewed conclusions due to limited or noisy profiling data.
Irregular memory access causing poor cache utilization.
Extra memory traffic caused by insufficient registers.
Higher-latency memory access in NUMA systems.
A performance model that determines whether workloads are compute-bound or memory-bound based on OI.
The operational intensity value at which the memory roofline and compute roofline intersect; workloads below this point are memory-bound, while those above it are compute-bound on a given hardware platform.
Using Roofline analysis to decide compute vs memory optimizations.
A GPU bottleneck where performance is constrained by shared-memory capacity or bandwidth (or bank conflicts), rather than global memory or tensor/FP32 compute throughput, common in highly optimized stencil and block-tiled kernels.
Compute-bound behavior caused by lack of parallelism.
Accessing nearby memory locations together.
Assuming peak hardware specs reflect real workload behavior.
The phenomenon where a small number of slow tasks or workers (“stragglers”) dominate end-to-end job completion time, making a system effectively bound by the slowest replica rather than average compute or memory performance.
Regular memory access with fixed spacing that may reduce cache efficiency.
A categorization of memory access patterns where streaming (sequential) access is designed to reach near-peak bandwidth, while strided or sparse patterns often yield lower effective bandwidth and turn a nominally bandwidth-bound workload into a latency-bound one.
Point where adding compute no longer improves performance.
Waiting time caused by coordination between parallel tasks.
Reusing data within short time intervals.
The proportion of time or FLOPs during which specialized tensor units (e.g., NVIDIA Tensor Cores) are actively executing useful math, used to determine whether a model is effectively using available matrix-multiply hardware or is limited by memory or scalar pipelines instead.
Compute-bound workload limited by tensor cores in ML accelerators.
Miss in address translation cache causing additional latency.
Memory-bound behavior caused by insufficient RAM.
Number of data elements processed per vector instruction.
Compute-bound behavior caused by limited SIMD or vector utilization.
Using SIMD instructions to increase compute throughput.
Ability to maintain performance as workload size grows.
Amount of data actively accessed by a workload.
No matching data found.