Performance Bottleneck Glossary
Limiting incoming traffic to avoid overload.
A standardized index (0–1) that converts response time distributions into a single user-satisfaction score based on “satisfied”, “tolerating” and “frustrated” thresholds.
Provider-imposed request throttling.
Inefficient application logic limiting performance.
Delay between load increase and resource availability.
Signaling upstream systems to slow down.
Network links operating at maximum throughput.
Expected normal system behavior used for comparison during troubleshooting.
Performance capped due to memory or latency constraints.
Threads waiting on slow I/O operations.
Temporary performance credits being depleted.
The proportion of requests served from cache vs backing store; low hit ratio often correlates with higher latency and backend saturation.
Required data not found in cache, forcing slower access.
Failure spreading across dependent services.
Excessive small requests between services.
Performance impact of frequent state persistence.
Automated cutoff preventing overload propagation.
Provider-enforced caps restricting scaling.
Cache that has not yet been populated.
Delay in spinning up new cloud resources.
Delay when an application instance starts from zero state.
The number of in-flight requests, sessions or operations being processed at a given time; tightly linked to throughput and latency via Little’s Law.
Excessive connection creation and teardown.
No available connections for new requests.
All database or service connections in use.
Kubernetes API server limiting cluster operations.
Cloud or Kubernetes control plane limiting operations.
Measurement bias that occurs when load generators stop sending new requests while waiting on slow ones, underreporting true latency.
Performance constrained intentionally to control spend.
When CPU utilization limits system performance.
Time when a virtual CPU waits because the hypervisor is busy.
Intentional reduction in CPU performance due to quotas or thermal limits.
Kubernetes limiting CPU beyond allocated quota.
The longest dependency chain that determines minimum execution time.
GPU waiting on slow data ingestion.
Database limiting application throughput or latency.
Delay in completing storage operations.
Centralized locking limiting scalability.
Tracking request flow across services to identify slow spans.
Congestion between internal services.
The percentage of requests that fail (typically 4xx/5xx or application-level failures), used to detect performance-related faults and SLO violations.
Retry strategy that progressively increases wait times and randomizes delays to reduce retry storms and synchronized thundering herds.
A scalability limit caused when many upstream services or clients depend on a single downstream service or resource.
Latency amplified by multiple downstream calls.
Visualization of execution hotspots.
Inefficient query scanning entire tables.
Application stall caused by memory cleanup.
High object allocation rate increasing GC frequency.
Latency, traffic, errors, and saturation metrics.
GPU compute or memory limiting ML workloads.
Idle GPU cycles due to inefficient pipelines.
Available spare capacity before a system hits a bottleneck.
Inefficient memory layout reducing usable heap.
Objects prematurely moving to old generation, causing long pauses.
Excessive task switching that wastes CPU cycles.
Disk receiving disproportionate I/O traffic.
A single key or small set of keys that receive disproportionate traffic, creating localized hotspots in caches, databases or partitions.
Uneven data access concentrating load on a subset of data.
Performance limited by disk read/write operations.
Delay caused by slow container image downloads.
Excessive hardware or network interrupts consuming CPU.
Storage hitting maximum operations per second.
UI stutter caused by backend or rendering delays.
CPU time dominated by kernel operations instead of application work.
The time taken to complete a single operation or request.
The maximum allowable end-to-end latency for a request, subdivided across services on a critical path for performance budgeting.
Metrics describing how long a given percentage of requests take to complete, used to understand median vs tail latency behavior.
Relationship between latency, throughput, and concurrency.
Intentionally dropping requests to protect stability.
Multiple transactions competing for database locks.
Threads blocked waiting for mutexes or spinlocks.
Delays caused by human-driven scaling.
Insufficient RAM limiting workload execution.
Gradual memory consumption due to unreleased objects.
Constant swapping between RAM and disk due to pressure.
File system or object store metadata becoming the limiting factor.
Very short, intense spikes in traffic or I/O that briefly exceed capacity and cause packet loss, queue buildup or jitter.
Delay during real-time prediction workloads.
Cross-zone communication overhead.
Excessive queries due to inefficient data access patterns.
Port exhaustion preventing new connections.
Performance constrained by bandwidth, latency, or packet loss.
Resource exhaustion at the node level.
One container consuming disproportionate resources.
Congestion between users and backend systems.
Termination of a process or container by the OS or runtime due to memory exhaustion, often triggered by leaks or undersized limits.
Excess resources causing inefficiency without gains.
Excessive locking reducing parallelism.
Dropped packets causing retransmissions and delays.
Frequent page faults causing CPU stalls.
Limited data transfer between CPU and GPU.
What users experience, not just what metrics show.
The component or constraint that limits overall system performance, regardless of optimization elsewhere.
Analyzing system behavior to locate bottlenecks.
CPU or memory caps restricting container performance.
Pods waiting due to insufficient cluster resources.
A condition where low-priority work holds a shared resource needed by high-priority work, degrading performance and responsiveness.
Organizational delays impacting system performance.
Slow infrastructure creation blocking scale-out.
Number of requests waiting to be processed.
Time a request or job spends waiting in a queue (thread pool, message queue, DB connection pool) before being executed.
Delay between primary and replica databases.
A state where a resource is fully utilized and cannot serve additional load.
End-to-end time between a request being sent and a response being received.
Retries increasing system pressure.
Excessive retries worsening load during failures.
Aggressive retries worsening outages.
Too many runnable threads waiting for CPU time.
Reduced performance before JIT or runtime optimizations stabilize.
Cost of encoding or decoding data formats.
A slow service limiting overall system throughput.
Delay in locating service endpoints.
Added latency from sidecars or proxies.
A precise, measurable metric (e.g., p95 latency, success rate) used to quantify system performance or reliability.
The agreed target for an SLI (e.g., p95 latency < 200 ms, 99.9% of the time) that drives performance and capacity decisions.
Performance impact from shared cloud hardware.
Performance capped because workload cannot parallelize.
Performance capped to meet contractual guarantees.
Query consuming excessive time or resources.
TCP ramp-up delay impacting short-lived connections.
Performance loss from many small read/write operations.
Full runtime pause halting all application threads.
Severe degradation caused by heavy swap usage.
Serial downstream calls multiplying latency.
Blocking operations delaying downstream execution.
High-percentile latency (p95, p99) that often defines user experience.
One lost packet delaying all following packets.
Re-sending packets due to loss or congestion.
Threads moving across cores, causing cache inefficiency.
No threads available to handle new requests.
Situation where there are not enough runnable threads to handle incoming work, causing requests to wait indefinitely or time out.
The amount of work a system can process per unit of time.
Bottleneck caused by bandwidth limits rather than IOPS.
Many processes waking simultaneously and overwhelming systems.
Time until the first byte of a response is received.
The slowest span in a distributed trace.
Overlapping transactions blocking progress.
Insufficient resources limiting performance.
State where frequently accessed data is already populated in cache, leading to lower latency and higher throughput vs a cold cache.
Concurrent writes blocking each other.
Writes blocked due to internal storage backpressure.
No matching data found.