Roofline Model Glossary
Reusing intermediate activations to improve OI.
Performance limit imposed by algorithm structure.
Operational intensity inherent to an algorithm’s structure.
Serial fractions limiting achievable speedup.
Alternate term for operational intensity used in HPC literature.
Systematic variation of problem size, blocking, or batch size to move a kernel horizontally across the Roofline and study how performance scales with AI.
Sustained memory bandwidth measured with microbenchmarks (e.g., STREAM), used as the practical memory roof instead of vendor peak specs.
Sustained floating-point throughput measured with tuned kernels (e.g., LINPACK/GEMM), used as the practical compute roof instead of theoretical peak.
Realistic performance achievable under actual hardware and software conditions.
Realistic ceiling derived from empirical measurements.
Memory-bound workload not saturating available bandwidth.
A point where increasing compute does not improve performance due to memory limits.
Roofline region where memory bandwidth dominates performance.
How batch size shifts operational intensity and performance.
Small program used to measure peak compute or bandwidth.
Inverse of operational intensity, indicating memory pressure per operation.
Data transfer speed between compute units and cache.
Frequency of data not found in cache.
Roofline extension modeling ceilings at different cache levels.
Roofline model independent of explicit cache tuning.
Formal Roofline extension introducing multiple performance ceilings.
Roofline applied to cloud VM or GPU instance benchmarking.
Roofline analysis applied at node/cluster scale, combining per-node compute/memory roofs with network roofs to reason about distributed performance limits.
Ratio of data movement to computation in an implementation.
The maximum performance achievable when limited by compute throughput.
Rate at which a processor executes operations (e.g., FLOPs).
Percentage of available compute resources actively used.
A point where memory is sufficient but compute resources are exhausted.
A workload limited by compute throughput rather than memory access.
Roofline region where compute throughput dominates performance.
Evaluating performance ceilings relative to cloud cost.
Proximity of data to compute units in the memory hierarchy.
Using loaded data multiple times before eviction.
OI calculated from runtime measurements.
Roofline derived from runtime profiling data.
Actual FLOPs achieved by a workload in practice.
Roofline constructed using measured hardware values.
Extension of the Roofline model that uses FLOPs per joule (or ops/W) as the metric, revealing energy-efficiency ceilings across AI ranges.
Incorrectly identifying a workload as compute-bound.
Measure of floating-point compute performance.
Arithmetic operations involving floating-point numbers.
Performance limit imposed by hardware constraints.
Low-level metrics used to measure operations and data movement.
Extension modeling performance across cache, DRAM, and HBM.
Extension modeling storage bandwidth as a performance limit.
Roofline variant that models only in-core limits (issue width, ports, vector units, front-end bandwidth) assuming data is already in the L1 cache or registers.
Roofline analysis specific to ML inference workloads.
Distribution of instruction types executed by a workload.
Roofline variant based on instruction throughput rather than FLOPs.
Rate at which instructions are executed by a processor.
Ability to execute multiple instructions simultaneously.
Combining multiple kernels to reduce memory traffic.
Non-compute cost of launching GPU kernels.
Intersection of the memory and compute roofs.
Balance between access delay and throughput.
Region of the Roofline plot where performance falls below the memory roof because memory access latency or poor MLP, not raw bandwidth, is the dominant limiter.
Roofline analysis per neural network layer.
Benchmark for measuring peak floating-point performance.
Ratio of memory operations to compute operations.
Technique to improve cache reuse by restructuring loops.
Ratio of peak compute throughput to peak memory bandwidth on a given system; defines the operational intensity at which workloads transition from memory-bound to compute-bound.
Variability affecting Roofline accuracy.
Sequence and regularity of memory accesses.
State where memory throughput is fully utilized.
Combining memory accesses into fewer GPU transactions.
Layered memory system including registers, caches, DRAM, and HBM.
The maximum performance achievable when limited by memory bandwidth.
Total data transferred between memory and compute.
A workload limited by memory bandwidth rather than compute capability.
Ability to overlap multiple memory accesses.
Use of lower-precision formats to improve performance and efficiency.
Roofline showing multiple ceilings for different memory layers.
Performance deviation due to shared cloud infrastructure.
Extension modeling communication bandwidth in distributed systems.
Performance limit imposed by NVLink interconnect bandwidth.
Proportion of active warps relative to hardware limits.
Performance limited by thread scheduling constraints.
Ratio of computational operations to bytes of data moved.
Minimum OI required for a workload to become compute-bound.
Applying Roofline to individual computational kernels or operators.
Movement of a workload across Roofline regimes after optimization.
Performance limit imposed by PCIe bandwidth.
Maximum data transfer rate supported by memory hardware.
Theoretical maximum compute throughput advertised by hardware specifications.
The upper bound on performance imposed by hardware limits.
Metric combining Roofline insights with pricing.
Modeling performance across pipeline stages.
Delay caused by data, control, or resource dependencies.
Degree to which compute pipelines remain active.
Performance ceiling where CPU/GPU is capped by TDP or power limits before hitting compute or memory roofs, common on dense or thermally constrained systems.
Loading data into cache before it is needed.
Constructing Roofline plots using profiling tools.
Irregular memory access causing cache inefficiency.
Performance impact from limited register availability.
The specific operational intensity at the intersection of the memory roof and compute roof, marking the boundary between bandwidth-limited and compute-limited regimes.
Using the model to decide whether to optimize compute or memory.
Applying the model to ML and AI performance analysis.
Application of the model to supercomputing workloads.
A visual performance model that shows the maximum achievable performance of a workload as a function of compute capability and memory bandwidth.
Process of deriving the actual roofs (compute, memory, interconnect) from empirical measurements and updating the model when hardware, drivers, or clocks change.
Improving performance by increasing OI or hardware utilization.
A log–log graph plotting attainable performance versus operational intensity.
Using Roofline insights (AI and distance to roofs) to drive automated search over tile sizes, vector width, and fusion choices to approach the relevant ceilings.
Using the model to prioritize optimization strategies.
Skewed conclusions from incomplete profiling.
Parallel execution of the same instruction across multiple data elements.
Accessing data stored near other recently accessed data.
Using vendor peak numbers instead of measured values.
Roofline based on theoretical hardware specifications.
Standard benchmark for measuring memory bandwidth.
Regular memory access with fixed spacing.
Roofline behavior as resources increase for fixed problem size.
Measured performance achieved during real workload execution.
Performance loss due to barriers or collective operations.
Reuse of data within short time intervals.
Degree to which GPU tensor cores are used.
FLOPs executed on specialized tensor units in accelerators.
Roofline analysis specific to ML training workloads.
Roofline adapted for transactional or database workloads.
Number of elements processed per vector instruction.
Code structured to maximize SIMD usage.
Effectiveness of SIMD or vector instruction usage.
GPU threads waiting due to memory or execution dependencies.
Roofline behavior as problem size grows with resources.
Amount of data actively used by a workload.
No matching data found.