Roofline Model Archives

A

Activation Reuse

Reusing intermediate activations to improve OI.

Algorithmic Bottleneck

Performance limit imposed by algorithm structure.

Algorithmic Intensity

Operational intensity inherent to an algorithm’s structure.

Amdahl’s Law Interaction

Serial fractions limiting achievable speedup.

Arithmetic Intensity

Alternate term for operational intensity used in HPC literature.

Arithmetic Intensity Sweep

Systematic variation of problem size, blocking, or batch size to move a kernel horizontally across the Roofline and study how performance scales with AI.

Attainable Memory Bandwidth

Sustained memory bandwidth measured with microbenchmarks (e.g., STREAM), used as the practical memory roof instead of vendor peak specs.

Attainable Peak FLOPS

Sustained floating-point throughput measured with tuned kernels (e.g., LINPACK/GEMM), used as the practical compute roof instead of theoretical peak.

Attainable Performance

Realistic performance achievable under actual hardware and software conditions.

Attainable Roof

Realistic ceiling derived from empirical measurements.

B

Bandwidth Underutilization

Memory-bound workload not saturating available bandwidth.

Bandwidth Wall

A point where increasing compute does not improve performance due to memory limits.

Bandwidth-Limited Regime

Roofline region where memory bandwidth dominates performance.

Batching Effect

How batch size shifts operational intensity and performance.

Benchmark Kernel

Small program used to measure peak compute or bandwidth.

Bytes per FLOP

Inverse of operational intensity, indicating memory pressure per operation.

C

Cache Bandwidth

Data transfer speed between compute units and cache.

Cache Miss Rate

Frequency of data not found in cache.

Cache-Aware Roofline

Roofline extension modeling ceilings at different cache levels.

Cache-Oblivious Roofline

Roofline model independent of explicit cache tuning.

Ceiling Model

Formal Roofline extension introducing multiple performance ceilings.

Cloud Instance Roofline

Roofline applied to cloud VM or GPU instance benchmarking.

Cluster-Level Roofline

Roofline analysis applied at node/cluster scale, combining per-node compute/memory roofs with network roofs to reason about distributed performance limits.

Code Balance

Ratio of data movement to computation in an implementation.

Compute Roof

The maximum performance achievable when limited by compute throughput.

Compute Throughput

Rate at which a processor executes operations (e.g., FLOPs).

Compute Utilization

Percentage of available compute resources actively used.

Compute Wall

A point where memory is sufficient but compute resources are exhausted.

Compute-Bound Workload

A workload limited by compute throughput rather than memory access.

Compute-Limited Regime

Roofline region where compute throughput dominates performance.

Cost-Performance Roofline

Evaluating performance ceilings relative to cloud cost.

D

Data Locality

Proximity of data to compute units in the memory hierarchy.

Data Reuse

Using loaded data multiple times before eviction.

Derived Operational Intensity

OI calculated from runtime measurements.

Dynamic Roofline

Roofline derived from runtime profiling data.

E

Effective FLOPS

Actual FLOPs achieved by a workload in practice.

Empirical Roofline Model

Roofline constructed using measured hardware values.

Energy Roofline

Extension of the Roofline model that uses FLOPs per joule (or ops/W) as the metric, revealing energy-efficiency ceilings across AI ranges.

F

False Compute-Bound Classification

Incorrectly identifying a workload as compute-bound.

FLOPS (Floating Point Operations per Second)

Measure of floating-point compute performance.

FLOPs (Floating Point Operations)

Arithmetic operations involving floating-point numbers.

G

H

Hardware Bottleneck

Performance limit imposed by hardware constraints.

Hardware Performance Counters

Low-level metrics used to measure operations and data movement.

Hierarchical Roofline

Extension modeling performance across cache, DRAM, and HBM.

I

I/O Roofline

Extension modeling storage bandwidth as a performance limit.

In-Core Roofline (Execution Roofline)

Roofline variant that models only in-core limits (issue width, ports, vector units, front-end bandwidth) assuming data is already in the L1 cache or registers.

Inference Roofline

Roofline analysis specific to ML inference workloads.

Instruction Mix

Distribution of instruction types executed by a workload.

Instruction Roofline

Roofline variant based on instruction throughput rather than FLOPs.

Instruction Throughput

Rate at which instructions are executed by a processor.

Instruction-Level Parallelism (ILP)

Ability to execute multiple instructions simultaneously.

J

K

Kernel Fusion

Combining multiple kernels to reduce memory traffic.

Kernel Launch Overhead

Non-compute cost of launching GPU kernels.

Knee Point

Intersection of the memory and compute roofs.

L

Latency vs Bandwidth Tradeoff

Balance between access delay and throughput.

Latency-Limited Regime

Region of the Roofline plot where performance falls below the memory roof because memory access latency or poor MLP, not raw bandwidth, is the dominant limiter.

Layerwise Roofline

Roofline analysis per neural network layer.

LINPACK Benchmark

Benchmark for measuring peak floating-point performance.

Load-Store Ratio

Ratio of memory operations to compute operations.

Loop Blocking / Tiling

Technique to improve cache reuse by restructuring loops.

M

Machine Balance

Ratio of peak compute throughput to peak memory bandwidth on a given system; defines the operational intensity at which workloads transition from memory-bound to compute-bound.

Measurement Noise

Variability affecting Roofline accuracy.

Memory Access Pattern

Sequence and regularity of memory accesses.

Memory Bandwidth Saturation

State where memory throughput is fully utilized.

Memory Coalescing

Combining memory accesses into fewer GPU transactions.

Memory Hierarchy

Layered memory system including registers, caches, DRAM, and HBM.

Memory Roof

The maximum performance achievable when limited by memory bandwidth.

Memory Traffic

Total data transferred between memory and compute.

Memory-Bound Workload

A workload limited by memory bandwidth rather than compute capability.

Memory-Level Parallelism (MLP)

Ability to overlap multiple memory accesses.

Mixed Precision Arithmetic

Use of lower-precision formats to improve performance and efficiency.

Multi-Level Roofline

Roofline showing multiple ceilings for different memory layers.

Multi-Tenant Interference

Performance deviation due to shared cloud infrastructure.

N

Network Roofline

Extension modeling communication bandwidth in distributed systems.

NVLink Roof

Performance limit imposed by NVLink interconnect bandwidth.

O

Occupancy (GPU)

Proportion of active warps relative to hardware limits.

Occupancy-Limited Regime

Performance limited by thread scheduling constraints.

Operational Intensity (OI)

Ratio of computational operations to bytes of data moved.

Operational Intensity Threshold

Minimum OI required for a workload to become compute-bound.

Operator-Level Roofline

Applying Roofline to individual computational kernels or operators.

Optimization Ceiling Shift

Movement of a workload across Roofline regimes after optimization.

P

PCIe Roof

Performance limit imposed by PCIe bandwidth.

Peak Memory Bandwidth

Maximum data transfer rate supported by memory hardware.

Peak Performance

Theoretical maximum compute throughput advertised by hardware specifications.

Performance Ceiling

The upper bound on performance imposed by hardware limits.

Performance per Dollar

Metric combining Roofline insights with pricing.

Pipeline Parallelism Roofline

Modeling performance across pipeline stages.

Pipeline Stall

Delay caused by data, control, or resource dependencies.

Pipeline Utilization

Degree to which compute pipelines remain active.

Power-Limited Roof

Performance ceiling where CPU/GPU is capped by TDP or power limits before hitting compute or memory roofs, common on dense or thermally constrained systems.

Prefetching

Loading data into cache before it is needed.

Profiler-Guided Roofline

Constructing Roofline plots using profiling tools.

Q

R

Random Access Pattern

Irregular memory access causing cache inefficiency.

Register Pressure

Performance impact from limited register availability.

Ridge Point

The specific operational intensity at the intersection of the memory roof and compute roof, marking the boundary between bandwidth-limited and compute-limited regimes.

Roofline Diagnostic

Using the model to decide whether to optimize compute or memory.

Roofline for AI Workloads

Applying the model to ML and AI performance analysis.

Roofline for HPC

Application of the model to supercomputing workloads.

Roofline Model

A visual performance model that shows the maximum achievable performance of a workload as a function of compute capability and memory bandwidth.

Roofline Model Calibration

Process of deriving the actual roofs (compute, memory, interconnect) from empirical measurements and updating the model when hardware, drivers, or clocks change.

Roofline Optimization

Improving performance by increasing OI or hardware utilization.

Roofline Plot

A log–log graph plotting attainable performance versus operational intensity.

Roofline-Guided Autotuning

Using Roofline insights (AI and distance to roofs) to drive automated search over tile sizes, vector width, and fusion choices to approach the relevant ceilings.

Roofline-Guided Optimization

Using the model to prioritize optimization strategies.

S

Sampling Bias

Skewed conclusions from incomplete profiling.

SIMD (Single Instruction Multiple Data)

Parallel execution of the same instruction across multiple data elements.

Spatial Locality

Accessing data stored near other recently accessed data.

Spec-Sheet Roofline Fallacy

Using vendor peak numbers instead of measured values.

Static Roofline

Roofline based on theoretical hardware specifications.

STREAM Benchmark

Standard benchmark for measuring memory bandwidth.

Strided Access

Regular memory access with fixed spacing.

Strong Scaling Roofline

Roofline behavior as resources increase for fixed problem size.

Sustained Performance

Measured performance achieved during real workload execution.

Synchronization Overhead

Performance loss due to barriers or collective operations.

T

Temporal Locality

Reuse of data within short time intervals.

Tensor Core Utilization

Degree to which GPU tensor cores are used.

Tensor FLOPs

FLOPs executed on specialized tensor units in accelerators.

Training Roofline

Roofline analysis specific to ML training workloads.

Transactional Roofline

Roofline adapted for transactional or database workloads.

U

V

Vector Width

Number of elements processed per vector instruction.

Vector-Friendly Code

Code structured to maximize SIMD usage.

Vectorization Efficiency

Effectiveness of SIMD or vector instruction usage.

W

Warp Stall

GPU threads waiting due to memory or execution dependencies.

Weak Scaling Roofline

Roofline behavior as problem size grows with resources.

Working Set Size

Amount of data actively used by a workload.

X

Y

Z

Roofline Model Glossary