GPU Glossary
The process of using a trained AI model to make predictions on new data, typically in real-time or batch mode.
AMP is a feature in frameworks like PyTorch and TensorFlow that automatically uses both FP16 and FP32 to speed up training while keeping model accuracy. It helps improve performance with less manual coding.
Tracks when and how GPU resources are accessed. It is important for auditing, usage reporting, and compliance.
The automatic adjustment of GPU resources based on workload demand to optimize cost and performance.
A physical GPU server in the cloud providing direct hardware access without virtualization overhead, maximizing performance and minimizing latency.
Estimating and managing how many GPUs you’ll need over time for consistent performance and cost control.
Saves model checkpoints in parts across devices or storage to avoid memory limits and enable faster recovery.
A Cloud GPU is a graphics processing unit that you can use remotely over the internet, without owning the hardware. It speeds up heavy tasks like AI training, video rendering, and data processing by providing powerful computing from a cloud provider.
Packaging applications and their dependencies into containers that can run on cloud GPU instances for consistent, portable deployment.
A lightweight container runtime interface used in Kubernetes, commonly paired with NVIDIA plugins to run GPU-enabled containers.
CUDA is a platform by NVIDIA that allows developers to use GPUs for general-purpose computing, not just graphics. It’s commonly used to speed up AI workloads, data processing, and scientific research.
cuDNN is a GPU-accelerated library from NVIDIA that speeds up deep learning operations like convolution, pooling, and activation. It’s used behind the scenes by frameworks like TensorFlow and PyTorch.
Data parallelism means running the same AI model on multiple GPUs, each handling a different batch of data. After each step, the GPUs sync their results to keep the model updated.
DCGM is a tool from NVIDIA that monitors the health, performance, and reliability of GPUs in data centers or cloud setups. It helps teams track usage, detect issues, and keep GPUs running smoothly.
Software libraries like TensorFlow, PyTorch, or MXNet that provide tools to build and train neural networks, optimized for GPU acceleration.
A device plugin is a Kubernetes component that makes GPUs (or other hardware) available to containers. It helps Kubernetes detect and manage GPUs just like it does with CPUs or memory.
NVIDIA’s high-performance servers built for deep learning and AI workloads, combining multiple powerful GPUs in one machine.
A strategy where the same model is copied across multiple GPUs, each handling a slice of the data, and syncing gradients after every step.
Processing data close to the source (e.g., sensors or devices) instead of in a centralized data center. Reduces latency and bandwidth usage.
Elastic GPU allocation lets cloud workloads automatically add or remove GPU resources as needed, without restarting the job. It helps balance performance with cost and flexibility.
Fault isolation ensures that if one GPU task fails, it doesn’t affect others running on the same hardware. Technologies like MIG and passthrough help keep workloads separated and safe.
A distributed AI training approach where models are trained across multiple decentralized devices or cloud GPUs without sharing raw data, enhancing privacy.
FP16 (half-precision) and BF16 (brain floating point) are lower-precision number formats used in deep learning. They help speed up training and reduce memory usage without significantly affecting accuracy.
A class of AI that generates text, images, audio, or code from learned patterns. Includes models like ChatGPT, DALL·E, or Stable Diffusion.
A specialized processor optimized for parallel processing, originally for graphics rendering, now widely used for AI, scientific computing, and data analytics.
GPU autoscaling automatically adjusts the number of GPUs based on workload demand. It helps balance performance and cost by scaling up during peak times and down when things are idle.
A scenario where GPU waits on CPU, disk, or network I/O, limiting performance despite available compute.
GPU checkpointing saves the progress of a training job, including model state and GPU memory, so it can resume later without starting over. It’s useful for long or interruptible tasks.
The time taken to initialize GPU instances or warm up models before they serve requests.
A GPU DaemonSet is a background service in Kubernetes that runs on all GPU-enabled nodes. It’s used to monitor GPU usage, health, and performance across the cluster.
Technology enabling direct memory access between GPUs and network devices, bypassing CPU to reduce latency and increase throughput in distributed GPU workloads.
GPU IaaS is a cloud service model where you can rent raw GPU-powered infrastructure on demand. It gives full control over the environment, making it ideal for custom AI, ML, or HPC workloads.
A GPU Instance is a virtual machine in the cloud that comes with one or more GPUs. It’s used for running tasks that need high-speed computing, like training AI models or rendering graphics.
A GPU kernel is a small program written to run on a GPU. It tells the GPU what operations to perform on data, often in parallel, making it essential for tasks like AI, simulations, and graphics processing.
The rate at which data can be read from or written to GPU memory, critical for performance in data-intensive workloads.
GPU memory fragmentation happens when memory is used in small, scattered chunks, making it hard to allocate space for larger tasks. This can lead to out-of-memory errors even if total memory isn’t fully used.
Managing and automating GPU resource allocation, scaling, and fault recovery across clusters.
GPU passthrough allows a virtual machine to use a physical GPU directly, without sharing it with others. It offers better performance and isolation, especially for high-security or latency-sensitive tasks.
Set limits on how many GPUs a user, team, or project can consume to avoid resource hogging or unexpected costs.
Malicious software embedded within GPU firmware or drivers that can manipulate GPU operations stealthily, evading traditional detection.
A GPU scheduler in Kubernetes assigns GPU resources to containers or pods based on availability and workload needs. It helps ensure the right jobs run on the right GPU nodes efficiently.
The fluctuating market for on-demand GPU pricing, often based on demand/supply cycles across providers.
GPU TCO is the total cost of using GPU infrastructure over time. It includes not just usage fees, but also setup, scaling, maintenance, downtime, and support costs.
Measures how effectively GPU compute is being used across time. High idle time means wasted spend.
GPU virtualization lets a single physical GPU be shared across multiple virtual machines or containers. It helps maximize GPU usage by allowing different users or tasks to run on the same hardware.
Enables GPUs to read data directly from storage (like NVMe or parallel file systems), bypassing CPU to reduce latency.
Pre-built containers with ML frameworks + GPU drivers to ensure compatibility and portability across cloud environments.
A method that lets you simulate large batch training on limited GPU memory by summing gradients over several smaller batches before updating weights.
A dedicated hardware device for secure cryptographic key management, sometimes used alongside GPUs for sensitive cloud workloads.
The use of powerful computing resources, including cloud GPUs, to solve complex scientific and engineering problems through parallel processing.
An inference pipeline is a system that runs a trained AI model to make predictions on new data. It handles tasks like input processing, running the model on a GPU, and sending back results, usually in real time.
Inference is the process of using a trained AI model to make predictions on new, unseen data. It’s what happens when the model is put into production to generate real results.
A high-throughput, low-latency networking protocol widely used in HPC and cloud GPU clusters to connect servers and GPUs efficiently.
Kubernetes extensions and plugins that enable efficient scheduling, management, and scaling of GPU resources in containerized environments.
A fine-tuning method that adds lightweight trainable layers to pre-trained models, making them adaptable with minimal GPU usage.
MIG is a technology that splits one powerful GPU into smaller, isolated units. Each unit runs its own task without affecting the others, making it ideal for running multiple jobs at once.
Technique combining lower-precision (FP16/BF16) and FP32 operations to speed up training and reduce memory use.
Model parallelism is a method where a large AI model is split across multiple GPUs, so each GPU handles a different part of the model. It’s used when a model is too big to fit on a single GPU.
A group of GPUs networked together to work in parallel, enabling large-scale distributed computing for AI training or HPC.
Keeps workloads from different users or teams securely separated when sharing the same GPU hardware.
NCCL is a library by NVIDIA that helps GPUs talk to each other quickly when training models across multiple GPUs. It handles tasks like syncing data and gradients efficiently during distributed training.
NGC is NVIDIA’s cloud platform that offers ready-to-use GPU containers, AI models, and software tools. It helps developers and researchers run GPU workloads faster without setup hassles.
NVIDIA’s tool to profile and analyze GPU applications, showing performance bottlenecks, memory usage, and kernel behavior.
Describes how memory is accessed in multi-CPU systems; GPU placement relative to CPUs affects latency and performance.
A workhorse GPU built for large-scale AI training, inference, and HPC. Known for its flexibility with MIG, massive memory, and Tensor Core performance.
Entry-level GPU optimized for lightweight AI inference and edge computing tasks. Offers power efficiency and compact form factor.
Balanced for both AI inference and training workloads. Offers multi-instance GPU (MIG) support and is optimized for mainstream data centers.
Purpose-built AI supercomputers integrating multiple GPUs, high-speed interconnects, and optimized software stacks for enterprise AI workloads.
NVIDIA’s flagship GPU for AI and HPC, offering up to 4x faster training and 30x faster inference over A100. Built on Hopper architecture.
Next-gen successor to H100, optimized for massive inference throughput and memory bandwidth. Tailored for Gen AI and LLMs.
A universal GPU for video processing, AI inference, and graphics workloads. Successor to the T4, designed for efficient, low-latency performance.
A high-performance data center GPU designed for graphics, AI, and simulation. Successor to RTX A6000, with better inference and ray tracing capabilities.
A pro-level graphics card built for content creation, rendering, and light AI workloads. Positioned for designers and engineers.
Part of the Ada Lovelace generation, this GPU provides enhanced ray tracing, faster inference, and better power efficiency.
A high-end workstation GPU offering top-tier graphics and compute for professional content creation and simulation workloads.
A powerful GPU for AI, data science, and creative professionals. Offers ample VRAM and RT Cores for ray tracing.
NVLink is a high-speed connection that lets GPUs share data with each other faster than traditional PCIe connections. It improves performance in multi-GPU setups.
A GPU metric showing how many threads are running versus what’s possible, helps assess how efficiently the GPU is used.
GPU resources billed per hour with no upfront commitment. Great for flexibility and experimentation.
A flexible pricing model where users pay only for the GPU resources they consume, typically billed by usage time.
Peer GPU communication allows GPUs to share data directly with each other, without going through the CPU. This speeds up multi-GPU tasks like training large AI models.
Attachments or mounts that keep data across sessions or pod restarts in GPU workloads, often used for caching or saving models.
A preemptible GPU is a low-cost GPU instance that can be taken back by the cloud provider at any time. It’s ideal for flexible, non-critical tasks like batch training or testing.
Removes less important weights from a neural network to reduce its size and improve inference speed without major accuracy drop.
A technique that converts model weights and activations from FP32 to lower precision (like INT8) to reduce size and speed up inference, often with minor accuracy loss.
A rendering technique that simulates light behavior for realistic images, accelerated by GPUs in cloud rendering services.
A GPU instance reserved for long-term use (monthly/yearly) at a lower hourly rate.
A pricing model where users commit to long-term GPU usage in exchange for discounted rates, ideal for steady, predictable workloads.
A method to restrict access to cloud GPU resources based on user roles, ensuring secure and compliant operations.
A security exploit that extracts sensitive data by analyzing indirect information such as timing or power consumption from GPUs in shared cloud environments.
Spot GPU instances are unused GPU resources offered at a lower price. They’re great for flexible, non-urgent tasks but can be stopped by the provider at any time.
A Tensor Core is a special type of processing unit inside NVIDIA GPUs designed to handle large matrix operations faster. It’s optimized for deep learning tasks like training and inference.
Tensor Cores are special processing units inside NVIDIA GPUs that speed up AI tasks by handling matrix math faster than regular cores. They’re designed to boost training and inference for deep learning.
TensorRT is an SDK from NVIDIA that helps make AI models run faster and more efficiently, especially during inference. It works by optimizing models so they use less memory and respond quicker.
The phase where AI models learn from data, requiring intensive GPU computation for matrix operations and optimization algorithms.
An NVIDIA server platform that runs AI inference models on GPUs with multi-framework support, batching, and performance optimization.
A virtual slice of a physical GPU assigned to a VM or container. It allows multiple users to share one GPU efficiently.
A virtual machine is a software-based computer that runs on physical hardware but acts like a real, separate system. It allows you to run multiple operating systems and workloads on the same server.It can be configured with GPU resources in the cloud for isolated, flexible workloads.
A group of 32 threads that execute in parallel on NVIDIA GPUs. Understanding warps helps write efficient CUDA code.
A method used in multi-GPU training to split model states across devices, reducing memory duplication and scaling large models.
No matching data found.