NVIDIA Blackwell Ultra Explained: B300, GB300 NVL72, and Reasoning AI

Jason Karlin

Last Updated: Feb 27, 2026

12 Minute Read

350 Views

NVIDIA Blackwell Ultra Explained: B300, GB300 NVL72, and Reasoning AI

Reasoning models, agentic workflows, and long context inference do not just want faster answers. They want more thinking, more tokens, and more iterations. NVIDIA’s Blackwell Ultra datasheet states that inference time scaling, sometimes called long thinking, can demand up to 100x more compute than traditional one-shot inference.

That swing in demand lands in the middle of a historic spending cycle.

Gartner forecast global AI spending at about $2.02 trillion in 2026, with AI processing semiconductors projected at about $267.9 billion.
Omdia projected AI data center chip market expansion continuing toward $286 billion by 2030, while noting AI infra spending to peak as a share of the data center in 2026.

This is precisely the context for the Blackwell Ultra, NVIDIA’s 2025 evolution of Blackwell aimed squarely at AI reasoning. Together, the B300 GPU and the GB300 rack-scale platform define how NVIDIA expects AI infrastructure to look through 2026.

It is building towards fewer isolated GPU servers, more tightly integrated AI factories built around memory, networking, power, and software as one system. Let’s learn more about NVIDIA Blackwell Ultra.

What is NVIDIA Blackwell Ultra?

NVIDIA introduced Blackwell Ultra at GTC in March 2025 as the next step in the Blackwell AI factory platform. The Blackwell Ultra is built to boost both training and test time scaling inference for reasoning, agentic AI, and physical AI use cases.

Blackwell Ultra shows up in two main shapes:

B300: The Blackwell Ultra GPU that appears in enterprise and hyperscale systems, including HGX B300 and DGX B300 configurations.
GB300: The Grace Blackwell Ultra platform that pairs NVIDIA Grace CPUs with Blackwell Ultra GPUs and scales up into the GB300 NVL72 rack.

Why NVIDIA Launched the Blackwell‘Ultra’?

This typical upgrade story focuses on raw throughput. Blackwell Ultra’s story is about inference no longer being a single pass. It is search, verification, tool use, and branching. That means:

KV cache gets bigger with long context and multi-step thinking.
Interconnect matters more as models and serving stacks spread across more GPUs.
Power delivery and cooling become first order design constraints because dense racks do not behave like traditional CPU clusters.

NVIDIA’s Blackwell Ultra datasheet ties the hardware to the workload, i.e., it describes inference time scaling as a third scaling dimension alongside pretraining and post training. The datasheet also highlights memory capacity for expansive KV caching and long context inference without offloading.

What is NVIDIA B300?

The B300 is best understood as the Blackwell Ultra GPU that anchors standardized server platforms, from HGX baseboards to DGX systems and partner designs. While exact SKU details can vary, NVIDIA describes that the full Blackwell Ultra GPU implementation contains up to 160 streaming multiprocessors and up to 288GB of HBM3E, with availability varying by SKU. For your reference, here are the architectural themes that matter in 2026:

Dual die design and faster on package communication

The Blackwell Ultra chip diagram NVIDIA published shows dual reticle dies linked by an on-package interface rated at 10 TB/s, which matters because modern models can be sensitive to cross-die latency and bandwidth.

Transformer Engine and 4 bit inference formats

Blackwell Ultra continues the move toward ultra-low precision inference with support for NVFP4, delivered via fifth generation Tensor Cores and a second-generation Transformer Engine in NVIDIA’s description.

Memory and bandwidth

NVIDIA highlights 288GB HBM3E and up to 8 TB/s memory bandwidth in the chip level diagram callouts. Larger, faster HBM is one of the most direct enablers for long context inference because it reduces the need to spill KV cache into slower tiers.

Interconnect for scaling within a node and across a rack

The same Blackwell Ultra diagram calls out:

PCIe Gen 6 with 256 GB/s
NVLink v5 with 1,800 GB/s to NVSwitch
NVLink C2C with 900 GB/s CPU to GPU

These figures map directly to the two scale patterns AI infrastructure now uses, i.e., scale up for large models and scale out for large fleets.

How much faster is B300 versus B200?

NVIDIA claims that DGX B300 boosts dense FP4 performance by 1.5x and attention performance by 2x over DGX B200.
Tom’s Hardware also reported the B300 as 1.5x faster than B200 in dense FP4 terms, with 288GB HBM3E and 15 PFLOPS dense FP4 listed for B300.

Taken together, the intent behind the B300 is consistent. Blackwell Ultra is an ‘Ultra’ refresh that targets the inference precision and attention heavy behavior that reasoning workloads amplify.

What is NVIDIA GB300?

If B300 is the GPU, GB300 is the way NVIDIA wants you to deploy it at scale. NVIDIA describes the Grace Blackwell Ultra Superchip as coupling one Grace CPU with two Blackwell Ultra GPUs via NVLink C2C. It positions the GPU as the foundational compute block for the GB300 NVL72 rack scale system.

Key GB300 NVL72 highlights

NVIDIA’s GB300 NVL72 comes with a rack configuration of 72 Blackwell Ultra GPUs and 36 Grace CPUs, connected with 130 TB/s of NVLink bandwidth, plus 37 TB of fast memory.
Microsoft’s Azure confirms, emphasizing 130 TB/s intra rack NVLink bandwidth, 37TB of fast memory, and up to 1,440 PFLOPS of FP4 Tensor Core performance per rack.
NVIDIA’s Blackwell Ultra datasheet adds a different angle, describing the rack as a unified NVLink domain.
It cites up to 37 TB of high-speed memory per rack coupled with 1.44 exaFLOPS of compute.
The platform delivers up to a 50x overall increase in AI factory output compared to Hopper based platforms when paired with NVIDIA networking and management software.

NOTE: The datasheet text references 279 GB of HBM3E per Blackwell Ultra chip, while other NVIDIA materials describe ‘up to 288GB’ by SKU. In practice, vendors often report physical capacity, usable capacity, or capacity in different units depending on ECC and how memory is counted.

If you need help finding the right GPIU for your workload, we have your back. Book your free consultation now and connect with our friendly Cloud GPU experts.

Performance and platform claims

From NVIDIA’s March 2025 announcement, the GB300 NVL72 is positioned to deliver 1.5x more AI performance than GB200 NVL72. This enables a large revenue opportunity uplift for AI factories compared with Hopper platforms.

NVIDIA also claims that GB300 NVL72 delivers 50x higher AI factory output. This combines 10x better latency measured as tokens per second per user and 5x higher throughput per megawatt relative to Hopper platforms.

Key NVIDIA Blackwell Ultra Configurations: HGX B300, DGX B300, and Partner Racks

The B300 and GB300 names are easier to reason when you map them to real products.

HGX B300 NVL16

NVIDIA describes HGX B300 NVL16 as delivering major gains versus Hopper. It claims 11x faster inference on large language models, 7x more compute, and 4x larger memory compared with the Hopper generation.

ServeTheHome’s coverage points to a platform shift as well, citing up to 2.3TB of HBM3E memory onboard and highlighting the NVL16 design approach.

NVIDIA’s HGX platform also claims up to 2.6x higher training performance for models such as DeepSeek R1. It calls out up to 14.4 TB/s of NVLink Switch bandwidth with over 2 TB of high-speed memory.

DGX B300

DGX B300 is the enterprise-friendly version of B300 that still resembles a traditional server. NVIDIA lists DGX B300 with:

8x Blackwell Ultra SXM GPUs
2.1 TB total GPU memory
Up to 14.4 TB/s aggregate NVLink bandwidth
144 PFLOPS FP4 Tensor Core performance (dense and sparse are listed as different values)

NVIDIA also states that the DGX B300 systems are being shipped now. This means that the gating factor is no longer the GPU announcement, but rack readiness, power, and network integration.

Partner ramps and volume signals

Supermicro announced it was delivering NVIDIA HGX B300 systems and GB300 NVL72 racks in volume to customers worldwide. It noted B300 and GB300 configurations that can use up to 1400W per GPU in certain setups.

Those details matter because by 2026, the hard part is often not ordering GPUs. It is qualifying facilities, cooling loops, and power distribution that can tolerate rapid GPU load ramps.

Key Use Cases for NVIDIA Blackwell Ultra

If your roadmap includes long context, agentic behavior, or high concurrency, these are the use cases that most directly benefit from Blackwell Ultra’s design priorities.

Reasoning inference (test-time scaling, long thinking)

Multi-step reasoning runs longer and revisits intermediate states. Extra compute helps maintain quality when models must verify, search, or refine outputs across many iterations.

Agentic AI services

Agents often call tools, branch into sub-tasks, and coordinate multiple steps. That workflow increases GPU utilization variance, so stable performance under bursty demand matters.

Long-context serving

Long prompts and large documents expand KV cache requirements quickly. More on-GPU memory and bandwidth reduce the need to spill to slower tiers that can add latency.

High-concurrency AI factory inference

Production systems care about tokens per second per user, not just peak throughput. Rack-scale designs target predictable behavior when many sessions run at once.

Frontier training and post-training

Training remains important, but post-training can be equally heavy, including fine-tuning and synthetic data loops. Faster iteration shortens the time from model idea to deployed capability.

Multimodal generative AI

Many workloads mix text with other modalities and add reasoning on top. The platform is built to support these blended pipelines without constant architecture changes.

Physical AI and world or video generation

Simulation and generation workloads can be both compute-intensive and latency-sensitive. When outputs must be produced quickly and repeatedly, scale and efficiency become decisive.

Networking with Blackwell Ultra: ConnectX 8, Quantum X800, and Spectrum X

Blackwell Ultra is designed to scale inside a rack and then scale across racks, and NVIDIA is unusually explicit about the network being part of the platform.

The March 2025 press emphasized 800 Gb/s data throughput per GPU through a ConnectX 8 SuperNIC, plus integration with Spectrum X Ethernet and Quantum X800 InfiniBand fabrics.
Microsoft’s Azuredescribes 800 Gbps per GPU cross rack scale out bandwidth via Quantum X800 InfiniBand and a non-blocking fabric approach to scale to tens of thousands of GPUs.

This is where the idea of the “AI factory” becomes concrete. It is a system in which the GPU, NIC, switch, and orchestration layer are designed together, so a fleet behaves predictably under inference bursts.

Power and Cooling in NVIDIA Blackwell Ultra

If you are planning for 2026, the most underestimated part of NVIDIA Blackwell Ultra adoption is not the GPU count. It is the facility.

Supermicro’s GB300 NVL72 datasheet lists an operating power range of 132 kW to 140 kW for the rack, alongside direct liquid cooling options and CDU configurations.

Microsoft described GB300 NVL72 racks delivering over 130 TB/s NVLink bandwidth and up to 136 kW of compute power in a single cabinet.

NVIDIA also talks about rack power management innovations, including ‘power smoothing.’ This is another signal that the platform is designed for the reality of spiky and high-utilization inference workloads.

If you are evaluating B300 versus GB300, this becomes a practical question:

DGX B300 and HGX B300 often fit into more familiar deployment patterns, even if power density is still high.
GB300 NVL72 pushes you toward liquid cooling, rack scale integration, and facility level planning by default.

Blackwell Ultra’s CUDA Compatibility and NVIDIA Dynamo

Blackwell Ultra’s success depends on whether it is easy to use at scale. NVIDIA stresses complete CUDA compatibility while highlighting optimized support across inference stacks and frameworks, including SGLang, TensorRT LLM, and vLLM.

The other major software entity is NVIDIA Dynamo, an open-source distributed inference and scheduling framework meant to orchestrate inference across many GPUs.

NVIDIA introduced Dynamo alongside Blackwell Ultra, describing it as a way to scale reasoning AI services with throughput and cost improvements. It claims that Dynamo can deliver up to 30x higher throughput for large scale deployments in certain contexts.

For 2026, this is critical because many organizations will not measure success by peak FLOPS. They will measure it by tokens per second per user, cost per million tokens, and cluster utilization under live traffic.

Blackwell Ultra Checklist for 2026 Planning

If you are budgeting or designing infrastructure around B300 and GB300, these are the items that tend to decide the timeline more than the GPU itself.

Factor	What to capture	Key questions
Facility readiness	Power density, heat rejection, and ability to support liquid cooling at scale	What rack/row power density can you sustain?What is the max heat rejection?Can you deploy and operate liquid cooling at scale (not just pilots)?
Network fabric	Choice of fabric and rack-scale topology/cabling approach	Are you standardizing on Quantum X800 InfiniBand or Spectrum X Ethernet?What topology (fat-tree/leaf-spine/dragonfly, etc.) and oversubscription?How are cabling, optics, and rack-scale layout handled?
Software stack maturity	Serving stack components and architecture maturity	Are you using Dynamo, TensorRT-LLM, and/or vLLM?How does serving use KV cache?Do you support disaggregated serving (prefill/decode split)?
Memory strategy	HBM capacity confirmation vs. workload/model requirements	What is the usable HBM per GPU for your specific SKU?What model sizes and long-context targets fit within that usable HBM (including KV cache growth)?

Looking into 2026, the NVIDIA Blackwell Ultra is built for the economics of AI factories. The winners are the teams that can convert power and silicon into reliable and low-latency tokens at scale.

Ready to Use Blackwell Ultra in 2026?

It is tempting to reduce every GPU generation to a single number. More FLOPS, more bandwidth, more memory. Blackwell Ultra does deliver those, and the B300 and GB300 names will be attached to plenty of benchmarks and procurement headlines.

Still, the more important shift is architectural given NVIDIA is treating reasoning era AI as a data center product, not a server part.

B300 brings the core Blackwell Ultra advances into HGX and DGX building blocks.
GB300 NVL72 turns those blocks into a rack that behaves like a single giant accelerator.

You get a complete package when you couple all that with NVLink domain design, integrated Grace CPUs, and networking that assumes you will scale across racks from day one.

Finding it challenging to migrate away from Hyperscalers? Book your free consultation to see how we make migration super simple, helping companies like you save almost 60 percent of cloud operation costs. Connect today!

Frequently Asked Questions

What is inference time scaling (long thinking)?

Multi-step inference (search/verify/tool use) that can require up to 100× more compute than one-shot inference.

What is Blackwell Ultra?

NVIDIA’s 2025 Blackwell evolution built for reasoning, agentic workflows, and long-context inference.

B300 vs GB300 - what’s the difference?

B300 is the GPU while GB300 is the (Grace + Blackwell) Ultra platform that scales into the NVL72 rack.

How much faster is B300 than B200?

NVIDIA claims almost 1.5× higher dense FP4 and ~2× higher attention performance (DGX B300 vs DGX B200).

What is GB300 NVL72?

A rack-scale system containing 72 GPUs and 36 Grace CPUs, unified NVLink domain, approx. 37TB fast memory, and approx. 1.44 EFLOPS FP4 per rack.

Why do HBM numbers differ (279GB vs “up to 288GB”)?

SKU variation and ‘usable vs physical’ reporting. You should confirm the usable HBM for your exact SKU.

What usually delays deployments?

Facility readiness: power density, heat rejection, and liquid cooling, plus rack-scale network/topology integration.

Jason Karlin

author

Industry veteran with over 10 years of experience architecting and managing GPU-powered cloud solutions. Specializes in enabling scalable AI/ML and HPC workloads for enterprise and research applications. Former lead solutions architect for top-tier cloud providers and startups in the AI infrastructure space.