Latency Hiding Glossary
Writing buffers without blocking execution.
Replicating data in the background after local writes.
Non-blocking remote calls for latency hiding.
Writing buffers without blocking execution.
Persisting application or training state to durable storage in the background without blocking the main computation, so periodic checkpoints do not introduce visible latency spikes or long pauses.
Explicit GPU instruction for overlapping memory transfer and compute.
Running operations without blocking the calling thread.
Non-blocking communication between services.
Updating model parameters without blocking synchronization.
Grouping multiple operations to amortize latency costs.
Performance loss when speculative execution is incorrect.
Guessing execution paths to avoid pipeline flushes.
Temporary storage used while cache lines are fetched from memory.
Preloading frequently accessed data into caches (CPU cache, DB cache, CDN, application cache) before real traffic arrives, so early requests do not pay cold-cache latency penalties.
Executing logic after async operations complete.
Multiple tasks making progress during overlapping time periods.
Reusing connections to avoid setup latency.
Switching execution between threads or processes to utilize idle CPU time.
Techniques to reduce startup latency from zero state.
A queue from which applications poll completion events for previously submitted asynchronous operations (I/O, RPC, GPU work), enabling many in-flight operations to overlap and hide latency.
Parallel execution queues enabling overlap of compute and data transfer.
Keeping data close to compute to reduce access latency.
Using two buffers so one is processed while the other is loaded.
An event-driven execution model where a single thread waits on an event loop and dispatches callbacks for I/O readiness, timers, and messages, allowing many concurrent operations to make progress without blocking a thread per request.
Triggering work based on events rather than synchronous calls.
Abstractions representing results of asynchronous operations.
Using massive thread concurrency to hide memory and execution latency on GPUs.
Reducing synchronization frequency in distributed training.
CPU capability to run multiple threads per core to hide stalls.
CPU logic that predicts and preloads future memory accesses.
Serialized dependencies that prevent effective latency hiding.
Sending backup requests to reduce tail latency.
Latency hiding that introduces resource contention elsewhere.
Multiple outstanding requests used to mask network delays.
Loading ML training data while accelerators compute.
Reordering instructions to hide long-latency operations.
Buffer of in-flight instructions that enables latency hiding.
Executing independent CPU instructions in parallel to hide execution delays.
Reordering disk I/O to hide seek latency.
Combining multiple GPU kernels into a single fused kernel so intermediate results stay in registers or shared memory, reducing global memory traffic and launch overhead that would otherwise expose latency.
Fixed latency cost of launching GPU kernels.
The time delay between initiating an operation and receiving its result.
Techniques that keep systems productive by performing useful work while waiting for slow operations such as memory access, I/O, or network calls.
Hiding overlaps delays, while reduction shortens delays directly.
Hiding latency so well that bottlenecks remain unnoticed.
GPU’s ability to mask latency using large numbers of threads.
Making hidden latency visible for observability and debugging.
Spreading fixed latency costs across larger workloads.
Deferring computation until results are required.
Delay between loading data from memory and using it in computation.
Increasing ILP by reducing loop control dependencies.
Combining GPU memory accesses into fewer transactions.
Allowing loads to bypass stores when it is safe to do so.
Issuing multiple memory requests concurrently to hide memory latency.
Processing small groups of requests or records together (smaller than full batch processing) to amortize fixed overheads (I/O, RPC, kernel launches) while keeping end-to-end latency within interactive bounds.
Hardware structure tracking outstanding cache misses to enable memory-level parallelism.
Sending multiple requests before responses arrive.
Cache that can service other requests while handling a miss.
I/O operations that allow execution to continue instead of waiting for completion.
Number of in-flight I/O requests used to hide storage latency.
Number of active warps available to hide latency.
CPU dynamically reorders instructions to execute independent work while waiting on slow operations.
Excessive parallelism that degrades performance instead of improving it.
Executing compute tasks while data is being fetched or written.
Simultaneous execution of tasks using multiple compute units.
A long-lived GPU kernel that stays resident on the device and pulls work from a queue, reducing kernel launch overhead and allowing continuous overlap of data transfers and compute.
Idle stages in pipelines that reduce latency-hiding efficiency.
Keeping execution pipelines continuously busy to avoid idle stages.
A deep learning parallelism strategy that splits a model into stages across devices and streams micro-batches through them, overlapping communication and compute to hide inter-device latency.
Idle pipeline cycles caused by unresolved dependencies or waiting for data.
Dividing work into stages that execute concurrently.
Queue depth used to hide data input latency in ML pipelines.
An asynchronous I/O pattern where operations are initiated once and completion handlers are invoked by the OS or runtime when the work finishes, allowing applications to hide I/O latency without manually polling descriptors.
Overlapping infrastructure setup with application initialization.
Loading data into cache before it is explicitly requested.
Decoupling data production and consumption to hide delays.
Serving frequent queries from cache instead of recomputing.
Excessive buffering that increases tail latency.
Bypassing CPUs to reduce network and memory latency.
Offloading reads to replicas to hide primary database latency.
Proactively loading data from storage to hide disk latency.
Eliminating false data dependencies to increase parallel execution.
Merging similar requests to reduce overhead.
Overlapping compilation with execution.
Parallelizing data fetches to hide latency.
CPU technique that tracks instruction dependencies to issue independent instructions early.
Fast on-chip memory used to hide global memory latency.
Contention that reduces effective latency hiding on GPUs.
Executing multiple hardware threads on a single core to hide instruction and memory latency.
Explicit instructions to fetch data ahead of use.
Accessing nearby memory locations together.
Executing instructions before outcomes are known to reduce idle cycles.
Issuing retries before timeouts expire.
Running duplicate copies of slow or high-risk tasks (e.g., map/reduce jobs) on additional workers so that the earliest successful result is used, hiding tail latency from stragglers.
Allowing stores to proceed without blocking execution.
Techniques focused on reducing p95/p99 response times.
Allowing more in-flight data to hide network latency.
Reusing data within a short time window.
Running more threads than hardware can execute simultaneously.
Using multiple threads to keep compute units busy while others stall.
Improving overall work completion by overlapping tasks even if individual operations remain slow.
Designing systems to stay busy despite slow operations.
Practice of allocating a fixed latency budget across each hop in a request path and setting per-service timeouts accordingly, so that retries and fallbacks can occur without exceeding end-to-end SLOs.
Overlapping address translation delays with useful execution.
Extending buffering to further reduce idle time.
Small buffer that hides cache conflict misses.
Keeping connections open to reduce handshake delays.
Keeping pre-initialized resources ready for use.
Starting services with pre-initialized state to reduce delay.
Switching between GPU warps when one stalls.
GPU context switching between execution groups to hide stalls.
A scheduling strategy where idle worker threads “steal” tasks from other workers’ queues, improving CPU utilization and hiding latency caused by imbalanced or stalled workloads.
Grouping writes to reduce perceived store latency.
Deferring writes to avoid blocking application flow.
No matching data found.