Latency Hiding Archives

A

Async Flush

Writing buffers without blocking execution.

Asynchronous Replication

Replicating data in the background after local writes.

Asynchronous RPC

Non-blocking remote calls for latency hiding.

Async Flush

Writing buffers without blocking execution.

Asynchronous Checkpointing

Persisting application or training state to durable storage in the background without blocking the main computation, so periodic checkpoints do not introduce visible latency spikes or long pauses.

Asynchronous Copy (cp.async)

Explicit GPU instruction for overlapping memory transfer and compute.

Asynchronous Execution

Running operations without blocking the calling thread.

Asynchronous Messaging

Non-blocking communication between services.

Asynchronous Parameter Updates

Updating model parameters without blocking synchronization.

B

Batching

Grouping multiple operations to amortize latency costs.

Branch Misprediction Penalty

Performance loss when speculative execution is incorrect.

Branch Prediction

Guessing execution paths to avoid pipeline flushes.

C

Cache Line Fill Buffer

Temporary storage used while cache lines are fetched from memory.

Cache Warming

Preloading frequently accessed data into caches (CPU cache, DB cache, CDN, application cache) before real traffic arrives, so early requests do not pay cold-cache latency penalties.

Callback-Based Execution

Executing logic after async operations complete.

Concurrency

Multiple tasks making progress during overlapping time periods.

Connection Pooling

Reusing connections to avoid setup latency.

Context Switching

Switching execution between threads or processes to utilize idle CPU time.

Cold Start Mitigation

Techniques to reduce startup latency from zero state.

Completion Queue

A queue from which applications poll completion events for previously submitted asynchronous operations (I/O, RPC, GPU work), enabling many in-flight operations to overlap and hide latency.

CUDA Streams / GPU Streams

Parallel execution queues enabling overlap of compute and data transfer.

D

Data Locality

Keeping data close to compute to reduce access latency.

Double Buffering

Using two buffers so one is processed while the other is loaded.

E

Event Loop / Reactor Pattern

An event-driven execution model where a single thread waits on an event loop and dispatches callbacks for I/O readiness, timers, and messages, allowing many concurrent operations to make progress without blocking a thread per request.

Event-Driven Architecture

Triggering work based on events rather than synchronous calls.

F

Futures / Promises

Abstractions representing results of asynchronous operations.

G

GPU Latency Hiding

Using massive thread concurrency to hide memory and execution latency on GPUs.

Gradient Accumulation

Reducing synchronization frequency in distributed training.

H

Hardware Multithreading

CPU capability to run multiple threads per core to hide stalls.

Hardware Prefetcher

CPU logic that predicts and preloads future memory accesses.

Head-of-Line Blocking

Serialized dependencies that prevent effective latency hiding.

Hedged Requests

Sending backup requests to reduce tail latency.

Hidden Contention

Latency hiding that introduces resource contention elsewhere.

I

In-Flight Requests

Multiple outstanding requests used to mask network delays.

Input Pipeline Overlap

Loading ML training data while accelerators compute.

Instruction Scheduling

Reordering instructions to hide long-latency operations.

Instruction Window

Buffer of in-flight instructions that enables latency hiding.

Instruction-Level Parallelism (ILP)

Executing independent CPU instructions in parallel to hide execution delays.

IO Scheduler Reordering

Reordering disk I/O to hide seek latency.

J

K

Kernel Fusion (GPU)

Combining multiple GPU kernels into a single fused kernel so intermediate results stay in registers or shared memory, reducing global memory traffic and launch overhead that would otherwise expose latency.

Kernel Launch Overhead

Fixed latency cost of launching GPU kernels.

L

Latency

The time delay between initiating an operation and receiving its result.

Latency Hiding

Techniques that keep systems productive by performing useful work while waiting for slow operations such as memory access, I/O, or network calls.

Latency Hiding vs Latency Reduction

Hiding overlaps delays, while reduction shortens delays directly.

Latency Masking Failure

Hiding latency so well that bottlenecks remain unnoticed.

Latency Tolerance (GPU)

GPU’s ability to mask latency using large numbers of threads.

Latency Transparency

Making hidden latency visible for observability and debugging.

Latency-Amortization

Spreading fixed latency costs across larger workloads.

Lazy Evaluation

Deferring computation until results are required.

Load-to-Use Latency

Delay between loading data from memory and using it in computation.

Loop Unrolling

Increasing ILP by reducing loop control dependencies.

M

Memory Coalescing

Combining GPU memory accesses into fewer transactions.

Memory Disambiguation

Allowing loads to bypass stores when it is safe to do so.

Memory-Level Parallelism (MLP)

Issuing multiple memory requests concurrently to hide memory latency.

Micro-Batching

Processing small groups of requests or records together (smaller than full batch processing) to amortize fixed overheads (I/O, RPC, kernel launches) while keeping end-to-end latency within interactive bounds.

Miss Status Holding Register (MSHR)

Hardware structure tracking outstanding cache misses to enable memory-level parallelism.

N

Network Pipelining

Sending multiple requests before responses arrive.

Non-Blocking Cache

Cache that can service other requests while handling a miss.

Non-Blocking I/O

I/O operations that allow execution to continue instead of waiting for completion.

NVMe Queue Depth

Number of in-flight I/O requests used to hide storage latency.

O

Occupancy (GPU)

Number of active warps available to hide latency.

Out-of-Order Execution

CPU dynamically reorders instructions to execute independent work while waiting on slow operations.

Over-Concurrency

Excessive parallelism that degrades performance instead of improving it.

Overlap of Computation and I/O

Executing compute tasks while data is being fetched or written.

P

Parallelism

Simultaneous execution of tasks using multiple compute units.

Persistent GPU Kernel

A long-lived GPU kernel that stays resident on the device and pulls work from a queue, reducing kernel launch overhead and allowing continuous overlap of data transfers and compute.

Pipeline Bubble

Idle stages in pipelines that reduce latency-hiding efficiency.

Pipeline Filling

Keeping execution pipelines continuously busy to avoid idle stages.

Pipeline Parallelism (DL)

A deep learning parallelism strategy that splits a model into stages across devices and streams micro-batches through them, overlapping communication and compute to hide inter-device latency.

Pipeline Stall

Idle pipeline cycles caused by unresolved dependencies or waiting for data.

Pipelining

Dividing work into stages that execute concurrently.

Prefetch Queue (ML)

Queue depth used to hide data input latency in ML pipelines.

Proactor Pattern

An asynchronous I/O pattern where operations are initiated once and completion handlers are invoked by the OS or runtime when the work finishes, allowing applications to hide I/O latency without manually polling descriptors.

Provisioning Overlap

Overlapping infrastructure setup with application initialization.

Prefetching

Loading data into cache before it is explicitly requested.

Producer–Consumer Pattern

Decoupling data production and consumption to hide delays.

Q

Query Result Caching

Serving frequent queries from cache instead of recomputing.

Queue Buildup

Excessive buffering that increases tail latency.

R

RDMA (Remote Direct Memory Access)

Bypassing CPUs to reduce network and memory latency.

Read Replicas

Offloading reads to replicas to hide primary database latency.

Read-Ahead

Proactively loading data from storage to hide disk latency.

Register Renaming

Eliminating false data dependencies to increase parallel execution.

Request Coalescing

Merging similar requests to reduce overhead.

Runtime JIT Warm-Up

Overlapping compilation with execution.

S

Scatter–Gather

Parallelizing data fetches to hide latency.

Scoreboarding

CPU technique that tracks instruction dependencies to issue independent instructions early.

Shared Memory (GPU)

Fast on-chip memory used to hide global memory latency.

Shared Memory Bank Conflict

Contention that reduces effective latency hiding on GPUs.

Simultaneous Multithreading (SMT)

Executing multiple hardware threads on a single core to hide instruction and memory latency.

Software Prefetching

Explicit instructions to fetch data ahead of use.

Spatial Locality

Accessing nearby memory locations together.

Speculative Execution

Executing instructions before outcomes are known to reduce idle cycles.

Speculative Retry

Issuing retries before timeouts expire.

Speculative Task Execution (Distributed)

Running duplicate copies of slow or high-risk tasks (e.g., map/reduce jobs) on additional workers so that the earliest successful result is used, hiding tail latency from stragglers.

Store Buffering

Allowing stores to proceed without blocking execution.

T

Tail Latency Hiding

Techniques focused on reducing p95/p99 response times.

TCP Window Scaling

Allowing more in-flight data to hide network latency.

Temporal Locality

Reusing data within a short time window.

Thread Oversubscription

Running more threads than hardware can execute simultaneously.

Thread-Level Parallelism (TLP)

Using multiple threads to keep compute units busy while others stall.

Throughput vs Latency Tradeoff

Improving overall work completion by overlapping tasks even if individual operations remain slow.

Throughput-Oriented Design

Designing systems to stay busy despite slow operations.

Timeout Budgeting / Per-Hop Timeouts

Practice of allocating a fixed latency budget across each hop in a request path and setting per-service timeouts accordingly, so that retries and fallbacks can occur without exceeding end-to-end SLOs.

TLB Miss Latency Hiding

Overlapping address translation delays with useful execution.

Triple Buffering

Extending buffering to further reduce idle time.

U

V

Victim Cache

Small buffer that hides cache conflict misses.

W

Warm Connections

Keeping connections open to reduce handshake delays.

Warm Pooling

Keeping pre-initialized resources ready for use.

Warm Start

Starting services with pre-initialized state to reduce delay.

Warp Scheduling

Switching between GPU warps when one stalls.

Wavefront Switching

GPU context switching between execution groups to hide stalls.

Work Stealing Scheduler

A scheduling strategy where idle worker threads “steal” tasks from other workers’ queues, improving CPU utilization and hiding latency caused by imbalanced or stalled workloads.

Write Combining

Grouping writes to reduce perceived store latency.

Write-Behind

Deferring writes to avoid blocking application flow.

X

Y

Z

Latency Hiding Glossary