Inference workloads Glossary
Comparing multiple model versions using real production traffic.
Using specialized hardware such as GPUs or AI accelerators for inference.
Mechanism that accepts, delays, or rejects inference requests based on capacity, policy, or latency protection goals.
Hardware and software environment used to run production inference workloads.
Mechanism for slowing or limiting incoming inference traffic when queues, latency, or resource usage exceed safe thresholds.
Running predictions on large datasets at scheduled intervals.
Deployment method using two environments to minimize downtime.
Gradually releasing a new model version to a small subset of users.
Running inference workloads on cloud-based infrastructure.
Delay experienced when an inference service starts after inactivity.
Maximum number of in-flight inference requests or streams a model instance is allowed to process at once.
Deploying inference systems inside containers for portability and scalability.
Maximum number of tokens a model can process in a single request.
Serving technique that continuously merges new requests into active decoding batches to improve GPU utilization and throughput.
Running model predictions using CPU processors.
Changes in input data distribution affecting predictions.
Autoregressive stage of generative inference where the model produces output tokens one step at a time using previously cached attention state.
Running inference workloads across multiple machines to improve scalability.
Running machine learning models on edge devices instead of centralized servers.
Hardware device located near the data source used for local inference.
Running model predictions on edge devices near the data source.
Synthetic or controlled startup request used to trigger model loading, graph compilation, or cache initialization before live traffic arrives.
Secondary model or serving path used when the primary model is unavailable, overloaded, or unsuitable for a request.
Transforming raw data into structured inputs suitable for machine learning models.
Centralized system used to manage features for training and inference.
Structured representation of input data used by models for predictions.
Running model inference using GPUs to accelerate predictions.
Techniques used to reduce GPU memory consumption during inference.
Measurement of how effectively GPU resources are used during inference.
Optimizing computational graphs for efficient inference execution.
Using optimized hardware to increase inference performance.
Optimized inference systems designed for low latency and high throughput predictions.
Automatically adjusting compute resources based on prediction demand.
Accumulated pending inference work waiting in queues because incoming demand exceeds current serving capacity.
Storage used to store recently computed predictions.
Techniques used to reduce infrastructure cost of model serving.
API endpoint that exposes a deployed model for prediction requests.
Runtime software responsible for executing optimized machine learning models.
Routing layer that directs prediction requests to appropriate model services.
Time required for a model to generate predictions after receiving input.
Containerized service responsible for handling prediction requests.
Tracking system performance metrics for deployed models.
End-to-end workflow that receives input data, executes the model, and returns predictions.
Time a request waits before the model begins processing it.
Input data sent to a deployed model to generate predictions.
Output returned by the model after processing an inference request.
System that routes requests to the correct model instance or version.
Performance guarantee defining acceptable latency and reliability.
Number of predictions a system can process per second.
Production workload where a trained machine learning model generates predictions from new input data.
Preparing raw data before feeding it into a machine learning model.
Average or percentile delay between consecutive output tokens during generative inference.
Memory cache storing attention keys and values to accelerate transformer inference.
Removal of cached attention states when memory pressure forces a serving system to discard older or lower-priority contexts.
Running inference using large language models for natural language tasks.
Distributing prediction requests across multiple model instances.
Serving approach where a base model is reused and lightweight adapter weights are loaded or applied per request or tenant.
Amount of memory required to run a model during inference.
Moving model parameters between GPU and CPU memory when required.
Packaged model file containing weights, architecture, and configuration used for deployment.
Converting models into optimized runtime formats for hardware acceleration.
Techniques used to shrink model size without significantly affecting accuracy.
Container image containing model code, dependencies, and runtime.
Process of making a trained model available for inference in a production environment.
Degradation in model performance caused by changing data patterns.
Process of running a trained model on new data to produce predictions or outputs.
Time required to load a model into memory before serving predictions.
Monitoring predictions, latency, and errors across inference systems.
Techniques used to improve inference performance and efficiency.
Removing redundant parameters from a model to reduce size and increase speed.
Reducing numerical precision of model weights to improve speed and memory usage.
System used to track, store, and manage versions of machine learning models.
Reverting to a previous model version after detecting issues.
Execution environment used to run deployed machine learning models.
Infrastructure responsible for hosting machine learning models and delivering predictions.
Splitting a model across multiple machines or GPUs for distributed inference.
Percentage of compute capacity actively used by inference workloads.
Managing multiple versions of deployed models to enable updates and rollback.
Preloading model weights into memory to reduce latency.
Running inference using models that process multiple data types such as text, images, and audio.
Hosting multiple models within the same inference infrastructure.
Serving inference workloads for multiple applications on shared infrastructure.
Processing predictions on stored datasets without real-time interaction.
Low-latency inference system designed to serve individual requests.
Cross-platform runtime used to execute machine learning models efficiently.
Dividing model layers across GPUs to enable distributed inference.
Output generated by a model after processing input data.
Reusing previous predictions to reduce compute requirements.
Recording requests and outputs for debugging and monitoring.
Initial stage of transformer inference where the model processes the full input context and populates KV cache before token-by-token decoding begins.
Reuse of previously computed prompt or prompt-prefix state so repeated or shared prefixes do not need full recomputation.
Percentage of requests whose prompt or prefix state can be reused from cache instead of recomputed.
Designing prompts to guide model outputs effectively.
Preparing user prompts before language model execution.
Running predictions immediately when a request arrives.
Combining multiple requests into one batch to improve processing efficiency.
Number of simultaneous inference requests processed by a system.
Queue that stores incoming inference requests before execution.
Sending model outputs incrementally as they are generated.
Performance tuning applied during model execution to improve inference speed.
Ability of inference systems to handle increasing request volumes.
Running inference workloads without managing underlying infrastructure.
An individual model-serving process or instance capable of handling inference requests as part of a larger serving fleet.
Running a new model alongside production without impacting users.
Inference optimization in which a smaller draft model proposes tokens that are then verified by a larger target model to increase generation speed.
High-percentile inference latency, such as p95 or p99, representing the slowest requests that most affect user experience.
Parameter controlling randomness in generative model outputs.
Improving tensor operations for faster model execution.
Splitting tensor computations across multiple GPUs during inference.
Software environment used to execute tensor computations.
GPU optimization framework used to accelerate deep learning inference workloads.
Time from receiving a generative inference request to producing the first output token.
Total time from request arrival until the full generated output is completed.
Configured limit on total tokens processed or generated for a request, tenant, or system interval to control cost and latency.
Process of generating output tokens during language model inference.
Time required to generate each token during language model inference.
Converting raw text into tokens used by language models.
Metric measuring language model inference speed during text generation.
Token selection strategy that chooses from the top K probable tokens.
Sampling strategy selecting tokens from a probability subset of likely outputs.
Pre-initialized model instance ready to handle requests.
No matching data found.