GPU Time-slicing vs. Passthrough

Jason Karlin

Last Updated: Sep 11, 2025

6 Minute Read

1574 Views

Graphics Processing Units (GPUs) have become indispensable for enterprise-grade systems handling heavy computational workloads. However, enterprises often end up oversubscribing to GPU hardware, leading to wasted processing power. This is especially true when running applications that do not fully utilize the available memory or compute capability of the GPU cluster.

To address this issue, enterprises can implement various methods, like Multi Process Service (MPS), Multi Instance GPU (MIG) and GPU time-slicing, to share idle GPU resources among multiple processes. Such approaches increase GPU utilization, minimize infrastructure costs and enhance productivity.

In this article we will explore time-slicing and study the benefits it brings to concurrent GPU processes.

What is GPU Time-Slicing?

Time-slicing is a GPU utilization technique designed to efficiently manage GPU over-subscription. It incorporates CUDA time-slicing, thereby enabling over-subscribed workloads to interleave within the GPU and take turns utilizing GPU resources.

System administrators can activate GPU time-slicing using device plugins to create “hypothetical” replicas of the installed GPU. The time-slice GPU scheduler is also initiated at the time to enable multiple users/ processes to access GPU resources sequentially in a fair-sharing manner with the scheduler switching between processes at fixed time intervals. Thus, each GPU instance takes turns independently executing its mapped workloads while utilizing the same available GPU.

Optimize Performance with AceCloud GPU Solutions

Experience reliable, high-performance GPUs tailored for your compute-intensive workloads

Book Consultation

As users/ processes make requests, the time-sliced GPU provides shared access. However, requesting one GPU time-slice multiple times does not guarantee the recurring receipt of equivalent GPU resources.

How does GPU Time-Slicing differ from GPU Passthrough?

Using GPUs in a virtualized environment generally incurs performance degradation. GPU Passthrough is a virtualization technique that mitigates this by allowing a virtual machine to access a physical GPU exclusively, i.e. the physical GPU is one-to-one mapped to the VM. This eliminates the VM’s reliance on drivers and software stacks for accessing the GPU, thereby achieving near-native GPU performance without lag or latency, even for high-end workloads.

GPU time-slicing, on the other hand, encourages the utilization of a single GPU for multiple VMs/ workloads to maximize GPU resource utilization and minimize costs associated with handling complex compute-intensive projects. Individual workloads/ containers can request specific virtual memory and compute sizes through time-slicing. Slicing the GPU into custom sizes as required for each container allows enterprises to efficiently utilize idle GPU resources, thereby reducing the number of nodes in a compute cluster.

By default, GPUs automatically go for time-slicing, i.e. serializing compute requests coming from separate workloads running concurrently. Simply put, when two or more processes/ applications run on the same GPU, the CUDA/ ROCm scheduler alternately switches between them, yielding each application the entirety of the GPU’s resources for one time-slice before giving way to the other application in a round-robin fashion.

GPU Time-slicing vs. Baseline vs. MPS: Average turnaround time and Average utilization on various MLPerf models (Source)

Note that these compute requests are serialized, and never co-located (i.e. executed on the GPU simultaneously). Thus, they can only be streamlined to an extent beyond serial execution performance.

What are the Benefits of GPU Time-Slicing?

Time-slicing is the default method of application concurrency on GPUs. Whereas, other GPU concurrency methods, such as MIG and MPS, allow a limited number of GPU partitions, time-slicing enables unlimited number of processes to run on the same GPU hardware.

Also Read: Optimize Your Fluid Dynamics With GPU Server Simulation

Secondly, unlike MIG where the GPU is partitioned into multiple fully independent, isolated instances of fixed CUDA and memory size, time-slicing is inherently flexible as the entirety of GPU CUDA and memory resources are being transferred from one process to the next. There is no underlying manufacturer-introduced constraint of unalterable partition size, nor the concomitant requirement to straitjacket processes to the memory available on that MIG partition.

Third, time-slicing is a feature available on all GPUs regardless of generation or architecture. This empowers enterprises to use less exorbitant GPUs of earlier generations, bringing cost-effectiveness to projects where GPU-acceleration is desirable, but expenses are also a critical concern. MIG, on the other hand, is only available on Ampere and Hopper architecture GPUs (A30, A100, H100).

What are the Disadvantages of GPU Time-Slicing?

Given the time-overhead involved in context-switching, GPU time-slicing is not suitable for latency-sensitive workloads. Additionally, contention arising from memory transfers can adversely affect turnaround time and predictability. In short, time-slicing significantly improves GPU utilization vis-a-vis GPU passthrough but culminates in reduced GPU performance per application because of latency when switching between processes.

Secondly, any or all the interleaved processes may not use the entirety of GPU resources available during their time-slice (CUDA threads, register memory, shared memory, etc.). Thus, unused GPU resources remain idle during each time-slice.

Third, time-slicing is an issue when working with Kubernetes containers as the K8s API labels GPU hardware as a discrete resource which cannot be oversubscribed. This, in turn, leads to bottlenecking of processes and under-utilization of GPU resources when running parallel containerized workloads on the same hardware. To remediate this drawback, however, both Nvidia and AMD have come up with API solutions to seamlessly manage workload concurrency via time-slicing.

Lastly, a key facet of time-slicing is that two concurrent processes may only be launched when the combined resource requirement of all the processes is less than the resources available on the GPU. Despite both processes never being executed simultaneously on the GPU, GPU time-slicing must adhere to this constraint, otherwise processes that begin execution after GPU resource limit has been reached tend to crash with out-of-memory errors.

Conclusion

As GPU usage exponentially intensifies across industries, virtualization, parallelism and concurrency handling methodologies will take even more of a center stage. Optimizing simultaneous process handling, maximizing GPU utilization and accelerating workload execution are all different facets of the same coin. In the race to extract the most out of GPU resources and achieve business objectives in a highly cost-efficient manner, AceCloud is your trusted partner. Connect with our team now and let us awe you with our enthusiasm for GPUs!

Jason Karlin

author

Industry veteran with over 10 years of experience architecting and managing GPU-powered cloud solutions. Specializes in enabling scalable AI/ML and HPC workloads for enterprise and research applications. Former lead solutions architect for top-tier cloud providers and startups in the AI infrastructure space.