Prefill-Decode Disaggregation on NVIDIA B200: Does It Lower Cost per Token?

Jason Karlin

Last Updated: Jun 23, 2026

12 Minute Read

5 Views

Prefill-Decode Disaggregation on NVIDIA B200: Does It Lower Cost per Token?

Prefill and decode are two integral phases of LLM inference. While prefill is a compute-bound matrix operation, decode is a more memory-bandwidth-bound operation. Prefill processes your input tokens in a single parallel pass, while decode generates output tokens one at a time.

These opposite compute profiles can be attributed to the way in which tokens are processed and decoded by each of them respectively. Running prefill and decode on the same GPU pool can increase cost per useful token when phase interference, queueing, poor batching, KV-cache pressure or low GPU utilization appear under real traffic. An NVIDIA B200-class system brings high Tensor Core throughput, high HBM bandwidth and large HBM capacity, making prefill/decode disaggregation technically attractive for large, latency-sensitive LLM serving. However, it is not automatically cheaper; KV-cache transfer, duplicated model replicas and pool-utilization balance must be benchmarked.

Here we explore how splitting these phases across dedicated B200 pools impacts your cost per token.

Prefill and Decode: The Two Phases of LLM Inference

In the context of LLMs, inference is the process that takes place when a user submits a prompt and the model generates an appropriate response (depending on the input context). This process is primarily divided into two main phases – prefill and decode.

In the prefill stage (input progressing), the model processes the complete input prompt, converted into tokens in parallel in one forward pass. As seen below, each token is attended to simultaneously and the resultant Key-Value pairs (KV cache) are computed and stored.

The complete input prompt is processed in a parallel, one forward pass, thereby resulting in heavy compute (or GPU) utilization. The concurrently executing computationally intensive matrix operations ensure that there is efficient utilization of the processing power of the GPUs.

Figure 1: Prefill vs. Decode [Image Source]

In the decode phase (output generation), the output tokens are generated one at a time. The token generation is done autoregressively, hence each token depends on the previously generated tokens. In normal serving, the decode stage reuses the KV cache from prefill so past prompt tokens do not need to be recomputed. If KV cache is evicted, lost or not transferred correctly, recomputation or request failure can occur depending on the serving system.

Decode is significantly slower than prefill, as the tokens are processed in a sequential manner. This essentially leads to under-utilization of GPU resources. The decode stage continues until the stop criterion is met (i.e., generation of a special end token or reaching a maximum number of tokens).

While prefill is more compute-intensive, the decode stage is more memory-bandwidth intensive. Processing all input tokens in parallel saturates the GPU’s compute cores, whereas decode must reload the full model weights from High-bandwidth memory (HBM) for every single token generated.

Factor	Prefill	Decode
Parallelism	High (all prompt tokens at once)	None (one token at a time)
Latency metric	Time to First Token (TTFT)	Time per Output Token (TPOT)
Bottleneck	Compute	Memory bandwidth

How Unified GPU Serving Increases Cost Per Token

As seen in the earlier section, the prefill and decode phases have conflicting resource needs on the same GPU pool. In a unified architecture, the same GPU pool handles prefill and decode for every request.

Unified LLM serving can create phase interference, head-of-line blocking, batching compromises, HBM contention and uneven utilization, all of which can increase latency and cost per useful token. Inefficiency caused due to underutilized GPUs gets reflected in higher costs per token.

Here are some of the salient reasons why unified LLM serving is inefficient:

Compute vs Memory Bandwidth Mismatch

The prefill stage focuses on high-throughput computing whereas the decode stage is latency sensitive since tokens stream on the screen one at a time. A TPOT/ITL increase can be visible to users, but the acceptable threshold depends on product SLO, model type, output length and streaming UI design.

Prefill uses GPU pools for large-scale parallel processing, while decode stage demands more HBM for storing the growing KV cache and to repeatedly stream model weights for every token generated. A unified GPU will cause under utilization for both the stages!

Head-Of-Line blocking (HOL blocking)

Though decode needs fewer GPU resources, long prefill requests can monopolize the GPU pools resulting in the decode requests to queue behind it. This is called Head-Of-Line (HOL) blocking.

This results in unpredictable latency spikes that not only degrades the user experience but also pushes up the cost per token.

Figure 2: Basic framework of Prefill-Decode separation [Image Source]

Prefill-decode disaggregation can reduce prefill/decode interference and HOL blocking by running the phases on separate workers or GPU pools, but it introduces KV-cache transfer, routing and duplicated-capacity overhead.

KV Cache Capacity Usage

The prefill and decode stages also share the same HBM. Prefill writes large KV caches whereas long decode sessions hold them resident. As both the stages also compete for the same HBM, it could lead to failures like evictions, Out Of Memory (OOM) errors, and forced batch size reduction.

If the KV cache needed by an active request is evicted or unavailable, the system may need recomputation, reload, transfer from another tier, or request failure handling depending on the serving engine. Over and above, a recomputation adds to more compute use. In a unified GPU serving setup, these costs get directly passed into the cost per token.

Batching Inefficiency

The same GPU pool handling prefill and decode requests also has implications for batching. While the prefill stage thrives on large token batches to maximize parallel execution, the decode phase demands seamless continuous batching for its sequential, single-token generation steps.

In a unified GPU setup, batching strategies must balance the competing requirements of prefill and decode workloads, making it difficult to fully optimize for either stage. Suboptimal batching means lower throughput per GPU, which means higher cost per token.

SLO (Service Level Objective) Conflicts

The two most important SLOs in LLM serving are:

TTFT SLO: Maximum acceptable time before the first token appears (prefill latency)
TPOT SLO: Maximum acceptable time between each generated token (decode latency)

To improve TTFT, prefill needs to be prioritized whereas decode has to be prioritized to improve the TPOT metric. Satisfying TTFT and TPOT/ITL on the same GPU can be difficult under mixed traffic because optimizing one can hurt the other. However, modern schedulers with chunked prefill and continuous batching can reduce the conflict.

In contrast to unified GPU serving, disaggregated serving allows prefill and decode SLOs to be met independently as each pool can be tuned and scaled for its own latency targets without conflicting with the other.

Why NVIDIA B200 GPUs Improve Inference Efficiency

The prefill stage saturates the GPU’s CUDA cores with raw FLOPs and decode leaves most of the GPU’s compute capacity idle while waiting on HBM reads. The NVIDIA B200 GPU which is built on the Blackwell architecture helps accelerate both these phases of LLM inference.

The B200 offers 2.25 PFLOPS of dense FP16/BF16 compute and 4.5 PFLOPS of dense FP8 performance per GPU, with FP4 providing an additional boost in throughput for supported workloads.

The GPU’s HBM3e memory bandwidth (up to 8.0 TB/s per GPU, optimized to 7.7 TB/s in standard HGX form factors) accelerates memory-bound decode. This enables faster prompt processing and more efficient token generation.

HGX/DGX B200-class systems commonly expose about 180 GB HBM3e per GPU and around 7.7–8.0 TB/s memory bandwidth, depending on SKU and measurement. This entirely eliminates the need for offloading model weights and KV cache data to CPU memory during LLM inference. The NVLink 5.0 interconnect at 1.8 TB/s also enables faster KV cache sharing across GPUs, making multi-GPU decode significantly more efficient.

Figure 3: KV Cache Eviction [Image created using AI]

The larger HBM capacity minimizes the likelihood of KV cache eviction, a costly failure mode where a KV cache entry is evicted from HBM mid-request and the prefill for that context has to be recomputed from scratch.

Renting an NVIDIA B200 on demand on AceCloud provides access to all of these capabilities, purpose-built for LLM inference and real-time serving.

Benefits of Separate B200 Pools For Prefill And Decode

Unified serving can be inefficient for some large, mixed and latency-sensitive workloads, but it is not always wrong. For small models, low concurrency, short prompts or optimized chunked-prefill schedulers, unified serving may still be simpler and cheaper.

This is where separate GPU pools are instrumental in independently optimizing each stage for their specific resource demands (i.e., compute for prefill and HBM for decode).

Here are some of the salient benefits that separate B200 pools offer to handle prefill and decode with utmost efficiency:

Eliminate Resource Contention

In a unified GPU serving, every scheduling decision that is beneficial for one stage hurts the other one. Separate GPU pools eliminate this conflict completely, since each pool can be tuned, sized, and scheduled independently.

With this change, compute-bound matrix-matrix operations in prefill run on their dedicated B200 GPU pool. In parallel, the memory-bandwidth-bound matrix-vector operations in decode run on a separate B200 GPU pool.

Execution on the designated B200 pools eliminates the resource contention issue, ensuring that neither stage competes for the other’s critical resources.

FP4 Precision to Prefill Pool

Built on the Blackwell architecture, NVIDIA B200 supports native hardware-accelerated 4-bit floating point (FP4) tensor core support. FP4 support further doubles compute throughput for prefill, enabling faster parallel token processing. This helps accelerate the inference for models that retain accuracy at FP4 precision.

Since decode is memory-bandwidth-bound, it does not directly benefit from FP4’s compute gains. However, disaggregated serving helps prefill leverage compute gains of FP4 while the decode pool runs independently, neither stage constraining the other.

Lower Costs Per Request

Large models (e.g., LLaMA 3 405B which is ~810 GB in FP16) require multiple GPUs across which the said model is split. A model of a size like ~810 GB, ~280 GB, or several TB still requires multiple B200 GPUs but fewer when compared to its predecessor, the B100.

Figure 4: B200 for serving a LLaMA 3 405B model in FP16 [Image created using AI]

The B200’s 192 GB HBM3e not only ensures that fewer GPUs are required to fit the model but it also helps reduce NVLink hops and inter-GPU communication overhead.

As far as decode is concerned, the larger HBM keeps long-context KV caches fully resident. This minimizes cache evictions and the costly prefill recomputation they trigger. As seen above, fewer GPUs/replica results in reduced infrastructure costs which directly translates into a lower cost per token.

Independent GPU Autoscaling

Irrespective of the phase, everything scales together in a unified GPU serving. With separate pools:

Prefill pool: Scales with prompt volume and average prompt length, directly controlling TTFT
Decode pool: Scales with output token demand and concurrent session count, directly controlling TPOT

This entirely eliminates issues related to GPU idling and GPU over-provisioning, two factors that directly inflate infrastructure costs in a unified setup.

Infrastructure Considerations for Disaggregated LLM Serving Platforms

Disaggregated LLM serving is an inference architecture where the prefill and decode stages run on separate, dedicated GPU pools. Running prefill and decode on dedicated B200 GPU pools eliminates the resource contention that arises when both stages compete for the same hardware.

Figure 5: Prefill-Decode Disaggregation [Image created using AI]

A standard cloud setup might not suffice the needs of disaggregated LLM serving. It requires a purpose-built GPU cloud infrastructure that provides:

Independent auto-scaling per pool
Dedicated observability (or monitoring) tools to track TTFT, TPOT, KV cache eviction rates, etc.
Independent latency diagnostics

Here are some of the factors that should be considered when shortlisting disaggregated LLM serving platforms:

Low-Latency KV Cache Handoff

KV caches that are generated during the prefill phase must be transferred to decode GPUs before token generation can begin. This is where high bandwidth and extremely low-latency network become essential to ensure that response times are not impacted by slow cache transfers.

Independent B200 Pools

Utilizing independent B200 GPU pools to isolate the prefill and decode stages ensures that infrastructure can effectively address increasing scalability requirements.

Surges in extensive input contexts allow the prefill pool to expand independently to maintain TTFT, while underutilized decode capacity can retract during off-peak hours to minimize infrastructure overhead. All of this helps in preventing resource wastage and efficient GPU pool utilization in both the stages.

End-to-End Observability

Cloud GPU platforms like AceCloud not only provide GPUs on demand but also offer visibility into key metrics that directly influence inference performance and cost efficiency.

Users should leverage detailed insights dashboard provided by disaggregated LLM serving platforms to monitor vital metrics such as TTFT, TPOT, Inter-Token Latency (ITL), KV cache transfer latency, and GPU utilization. These insights provide visibility into the system performance and help teams quickly identify and resolve bottlenecks.

Fast Model Access

Both prefill and decode pools require rapid access to model weights to support efficient inference and elastic scaling. In a disaggregated serving, fast model loading can have a significant impact on the responsiveness of scale-up events.

High-performance storage infrastructure by AceCloud ensures that model weights are loaded quickly across B200 pools, thereby minimizing cold-start overhead and maintaining consistent inference performance.

Conclusion

Prefill and decode are two important phases in LLM inference. For large, latency-sensitive and mixed prompt/output workloads, prefill-decode disaggregation can be considered so both phases do not compete for the same GPU resources. For smaller or simpler workloads, unified serving may remain cheaper and easier to operate.

A GPU-on-demand provider like AceCloud makes disaggregated serving accessible by providing dedicated B200 pools, ensuring each stage gets the right hardware without the overhead of managing fixed infrastructure.

Jason Karlin

author

Industry veteran with over 10 years of experience architecting and managing GPU-powered cloud solutions. Specializes in enabling scalable AI/ML and HPC workloads for enterprise and research applications. Former lead solutions architect for top-tier cloud providers and startups in the AI infrastructure space.