GPU Glossary – Key Cloud GPU Terms Explained Simply

A

AI Inferencing

The process of using a trained AI model to make predictions on new data, typically in real-time or batch mode.

AMP (Automatic Mixed Precision)

AMP is a feature in frameworks like PyTorch and TensorFlow that automatically uses both FP16 and FP32 to speed up training while keeping model accuracy. It helps improve performance with less manual coding.

Audit Logs for GPU Access

Tracks when and how GPU resources are accessed. It is important for auditing, usage reporting, and compliance.

Auto-Scaling

The automatic adjustment of GPU resources based on workload demand to optimize cost and performance.

B

Bare Metal GPU

A physical GPU server in the cloud providing direct hardware access without virtualization overhead, maximizing performance and minimizing latency.

C

Capacity Planning

Estimating and managing how many GPUs you’ll need over time for consistent performance and cost control.

Checkpoint Sharding

Saves model checkpoints in parts across devices or storage to avoid memory limits and enable faster recovery.

Cloud GPU

A Cloud GPU is a graphics processing unit that you can use remotely over the internet, without owning the hardware. It speeds up heavy tasks like AI training, video rendering, and data processing by providing powerful computing from a cloud provider.

Containerization

Packaging applications and their dependencies into containers that can run on cloud GPU instances for consistent, portable deployment.

CRI-O

A lightweight container runtime interface used in Kubernetes, commonly paired with NVIDIA plugins to run GPU-enabled containers.

CUDA

CUDA is a platform by NVIDIA that allows developers to use GPUs for general-purpose computing, not just graphics. It’s commonly used to speed up AI workloads, data processing, and scientific research.

cuDNN

cuDNN is a GPU-accelerated library from NVIDIA that speeds up deep learning operations like convolution, pooling, and activation. It’s used behind the scenes by frameworks like TensorFlow and PyTorch.

D

Data Parallelism

Data parallelism means running the same AI model on multiple GPUs, each handling a different batch of data. After each step, the GPUs sync their results to keep the model updated.

DCGM (Data Center GPU Manager)

DCGM is a tool from NVIDIA that monitors the health, performance, and reliability of GPUs in data centers or cloud setups. It helps teams track usage, detect issues, and keep GPUs running smoothly.

Deep Learning Framework

Software libraries like TensorFlow, PyTorch, or MXNet that provide tools to build and train neural networks, optimized for GPU acceleration.

Device Plugin

A device plugin is a Kubernetes component that makes GPUs (or other hardware) available to containers. It helps Kubernetes detect and manage GPUs just like it does with CPUs or memory.

DGX Systems

NVIDIA’s high-performance servers built for deep learning and AI workloads, combining multiple powerful GPUs in one machine.

Distributed Data Parallel (DDP)

A strategy where the same model is copied across multiple GPUs, each handling a slice of the data, and syncing gradients after every step.

E

Edge Computing

Processing data close to the source (e.g., sensors or devices) instead of in a centralized data center. Reduces latency and bandwidth usage.

Elastic GPU Allocation

Elastic GPU allocation lets cloud workloads automatically add or remove GPU resources as needed, without restarting the job. It helps balance performance with cost and flexibility.

F

Fault Isolation (MIG, Passthrough)

Fault isolation ensures that if one GPU task fails, it doesn’t affect others running on the same hardware. Technologies like MIG and passthrough help keep workloads separated and safe.

Federated Learning

A distributed AI training approach where models are trained across multiple decentralized devices or cloud GPUs without sharing raw data, enhancing privacy.

FP16 / BF16

FP16 (half-precision) and BF16 (brain floating point) are lower-precision number formats used in deep learning. They help speed up training and reduce memory usage without significantly affecting accuracy.

G

Gen AI (Generative AI)

A class of AI that generates text, images, audio, or code from learned patterns. Includes models like ChatGPT, DALL·E, or Stable Diffusion.

GPU (Graphics Processing Unit)

A specialized processor optimized for parallel processing, originally for graphics rendering, now widely used for AI, scientific computing, and data analytics.

GPU Autoscaling

GPU autoscaling automatically adjusts the number of GPUs based on workload demand. It helps balance performance and cost by scaling up during peak times and down when things are idle.

GPU Bottlenecking

A scenario where GPU waits on CPU, disk, or network I/O, limiting performance despite available compute.

GPU Checkpointing

GPU checkpointing saves the progress of a training job, including model state and GPU memory, so it can resume later without starting over. It’s useful for long or interruptible tasks.

GPU Cold Start

The time taken to initialize GPU instances or warm up models before they serve requests.

GPU DaemonSet

A GPU DaemonSet is a background service in Kubernetes that runs on all GPU-enabled nodes. It’s used to monitor GPU usage, health, and performance across the cluster.

GPU Direct RDMA

Technology enabling direct memory access between GPUs and network devices, bypassing CPU to reduce latency and increase throughput in distributed GPU workloads.

GPU IaaS (Infrastructure as a Service)

GPU IaaS is a cloud service model where you can rent raw GPU-powered infrastructure on demand. It gives full control over the environment, making it ideal for custom AI, ML, or HPC workloads.

GPU Instance

A GPU Instance is a virtual machine in the cloud that comes with one or more GPUs. It’s used for running tasks that need high-speed computing, like training AI models or rendering graphics.

GPU Kernel

A GPU kernel is a small program written to run on a GPU. It tells the GPU what operations to perform on data, often in parallel, making it essential for tasks like AI, simulations, and graphics processing.

GPU Memory Bandwidth

The rate at which data can be read from or written to GPU memory, critical for performance in data-intensive workloads.

GPU Memory Fragmentation

GPU memory fragmentation happens when memory is used in small, scattered chunks, making it hard to allocate space for larger tasks. This can lead to out-of-memory errors even if total memory isn’t fully used.

GPU Orchestration

Managing and automating GPU resource allocation, scaling, and fault recovery across clusters.

GPU Passthrough

GPU passthrough allows a virtual machine to use a physical GPU directly, without sharing it with others. It offers better performance and isolation, especially for high-security or latency-sensitive tasks.

GPU Quotas

Set limits on how many GPUs a user, team, or project can consume to avoid resource hogging or unexpected costs.

GPU Rootkit

Malicious software embedded within GPU firmware or drivers that can manipulate GPU operations stealthily, evading traditional detection.

GPU Scheduler (Kubernetes)

A GPU scheduler in Kubernetes assigns GPU resources to containers or pods based on availability and workload needs. It helps ensure the right jobs run on the right GPU nodes efficiently.

GPU Spot Market

The fluctuating market for on-demand GPU pricing, often based on demand/supply cycles across providers.

GPU TCO (Total Cost of Ownership)

GPU TCO is the total cost of using GPU infrastructure over time. It includes not just usage fees, but also setup, scaling, maintenance, downtime, and support costs.

GPU Utilization

Measures how effectively GPU compute is being used across time. High idle time means wasted spend.

GPU Virtualization

GPU virtualization lets a single physical GPU be shared across multiple virtual machines or containers. It helps maximize GPU usage by allowing different users or tasks to run on the same hardware.

GPUDirect Storage

Enables GPUs to read data directly from storage (like NVMe or parallel file systems), bypassing CPU to reduce latency.

GPU-Optimized Container

Pre-built containers with ML frameworks + GPU drivers to ensure compatibility and portability across cloud environments.

Gradient Accumulation

A method that lets you simulate large batch training on limited GPU memory by summing gradients over several smaller batches before updating weights.

H

Hardware Security Module (HSM)

A dedicated hardware device for secure cryptographic key management, sometimes used alongside GPUs for sensitive cloud workloads.

High-Performance Computing (HPC)

The use of powerful computing resources, including cloud GPUs, to solve complex scientific and engineering problems through parallel processing.

I

Inference Pipeline

An inference pipeline is a system that runs a trained AI model to make predictions on new data. It handles tasks like input processing, running the model on a GPU, and sending back results, usually in real time.

Inference

Inference is the process of using a trained AI model to make predictions on new, unseen data. It’s what happens when the model is put into production to generate real results.

InfiniBand

A high-throughput, low-latency networking protocol widely used in HPC and cloud GPU clusters to connect servers and GPUs efficiently.

J

K

Kubernetes GPU Scheduling

Kubernetes extensions and plugins that enable efficient scheduling, management, and scaling of GPU resources in containerized environments.

L

LoRA (Low-Rank Adaptation)

A fine-tuning method that adds lightweight trainable layers to pre-trained models, making them adaptable with minimal GPU usage.

M

MIG (Multi-Instance GPU)

MIG is a technology that splits one powerful GPU into smaller, isolated units. Each unit runs its own task without affecting the others, making it ideal for running multiple jobs at once.

Mixed Precision Training

Technique combining lower-precision (FP16/BF16) and FP32 operations to speed up training and reduce memory use.

Model Parallelism

Model parallelism is a method where a large AI model is split across multiple GPUs, so each GPU handles a different part of the model. It’s used when a model is too big to fit on a single GPU.

Multi-GPU Cluster

A group of GPUs networked together to work in parallel, enabling large-scale distributed computing for AI training or HPC.

Multi-Tenancy Isolation

Keeps workloads from different users or teams securely separated when sharing the same GPU hardware.

N

NCCL (NVIDIA Collective Comm Library)

NCCL is a library by NVIDIA that helps GPUs talk to each other quickly when training models across multiple GPUs. It handles tasks like syncing data and gradients efficiently during distributed training.

NGC (NVIDIA GPU Cloud)

NGC is NVIDIA’s cloud platform that offers ready-to-use GPU containers, AI models, and software tools. It helps developers and researchers run GPU workloads faster without setup hassles.

Nsight Systems

NVIDIA’s tool to profile and analyze GPU applications, showing performance bottlenecks, memory usage, and kernel behavior.

NUMA (Non-Uniform Memory Access)

Describes how memory is accessed in multi-CPU systems; GPU placement relative to CPUs affects latency and performance.

NVIDIA A100 GPU

A workhorse GPU built for large-scale AI training, inference, and HPC. Known for its flexibility with MIG, massive memory, and Tensor Core performance.

NVIDIA A2 GPU

Entry-level GPU optimized for lightweight AI inference and edge computing tasks. Offers power efficiency and compact form factor.

NVIDIA A30 GPU

Balanced for both AI inference and training workloads. Offers multi-instance GPU (MIG) support and is optimized for mainstream data centers.

NVIDIA DGX Systems

Purpose-built AI supercomputers integrating multiple GPUs, high-speed interconnects, and optimized software stacks for enterprise AI workloads.

NVIDIA H100 GPU

NVIDIA’s flagship GPU for AI and HPC, offering up to 4x faster training and 30x faster inference over A100. Built on Hopper architecture.

NVIDIA H200 GPU

Next-gen successor to H100, optimized for massive inference throughput and memory bandwidth. Tailored for Gen AI and LLMs.

NVIDIA L4 GPU

A universal GPU for video processing, AI inference, and graphics workloads. Successor to the T4, designed for efficient, low-latency performance.

NVIDIA L40S GPU

A high-performance data center GPU designed for graphics, AI, and simulation. Successor to RTX A6000, with better inference and ray tracing capabilities.

NVIDIA RTX Pro6000 GPU

A pro-level graphics card built for content creation, rendering, and light AI workloads. Positioned for designers and engineers.

NVIDIA RTX6000 Ada GPU

Part of the Ada Lovelace generation, this GPU provides enhanced ray tracing, faster inference, and better power efficiency.

NVIDIA RTX8000 GPU

A high-end workstation GPU offering top-tier graphics and compute for professional content creation and simulation workloads.

NVIDIA RTXA6000 GPU

A powerful GPU for AI, data science, and creative professionals. Offers ample VRAM and RT Cores for ray tracing.

NVLink

NVLink is a high-speed connection that lets GPUs share data with each other faster than traditional PCIe connections. It improves performance in multi-GPU setups.

O

Occupancy

A GPU metric showing how many threads are running versus what’s possible, helps assess how efficiently the GPU is used.

On-Demand GPU

GPU resources billed per hour with no upfront commitment. Great for flexibility and experimentation.

P

Pay-As-You-Go

A flexible pricing model where users pay only for the GPU resources they consume, typically billed by usage time.

Peer GPU Communication

Peer GPU communication allows GPUs to share data directly with each other, without going through the CPU. This speeds up multi-GPU tasks like training large AI models.

Persistent GPU Volumes

Attachments or mounts that keep data across sessions or pod restarts in GPU workloads, often used for caching or saving models.

Preemptible GPU

A preemptible GPU is a low-cost GPU instance that can be taken back by the cloud provider at any time. It’s ideal for flexible, non-critical tasks like batch training or testing.

Pruning

Removes less important weights from a neural network to reduce its size and improve inference speed without major accuracy drop.

Q

Quantization

A technique that converts model weights and activations from FP32 to lower precision (like INT8) to reduce size and speed up inference, often with minor accuracy loss.

R

Ray Tracing

A rendering technique that simulates light behavior for realistic images, accelerated by GPUs in cloud rendering services.

Reserved GPU Instance

A GPU instance reserved for long-term use (monthly/yearly) at a lower hourly rate.

Reserved Instance

A pricing model where users commit to long-term GPU usage in exchange for discounted rates, ideal for steady, predictable workloads.

Role-Based Access Control (RBAC)

A method to restrict access to cloud GPU resources based on user roles, ensuring secure and compliant operations.

S

Side-Channel Attack

A security exploit that extracts sensitive data by analyzing indirect information such as timing or power consumption from GPUs in shared cloud environments.

Spot GPU Instances

Spot GPU instances are unused GPU resources offered at a lower price. They’re great for flexible, non-urgent tasks but can be stopped by the provider at any time.

T

Tensor Core

A Tensor Core is a special type of processing unit inside NVIDIA GPUs designed to handle large matrix operations faster. It’s optimized for deep learning tasks like training and inference.

Tensor Cores

Tensor Cores are special processing units inside NVIDIA GPUs that speed up AI tasks by handling matrix math faster than regular cores. They’re designed to boost training and inference for deep learning.

TensorRT

TensorRT is an SDK from NVIDIA that helps make AI models run faster and more efficiently, especially during inference. It works by optimizing models so they use less memory and respond quicker.

Training

The phase where AI models learn from data, requiring intensive GPU computation for matrix operations and optimization algorithms.

Triton Inference Server

An NVIDIA server platform that runs AI inference models on GPUs with multi-framework support, batching, and performance optimization.

U

V

vGPU (Virtual GPU)

A virtual slice of a physical GPU assigned to a VM or container. It allows multiple users to share one GPU efficiently.

Virtual Machine (VM)

A virtual machine is a software-based computer that runs on physical hardware but acts like a real, separate system. It allows you to run multiple operating systems and workloads on the same server.It can be configured with GPU resources in the cloud for isolated, flexible workloads.

W

Warp

A group of 32 threads that execute in parallel on NVIDIA GPUs. Understanding warps helps write efficient CUDA code.

X

Y

Z

Zero Redundancy Optimizer (ZeRO)

A method used in multi-GPU training to split model states across devices, reducing memory duplication and scaling large models.