Pages

Monday, June 22, 2026

NVIDIA H200 vs B300: precisions, cores, and what the datasheet numbers mean

Hopper against Blackwell Ultra. This piece covers not just the specs of both accelerators, but also the concepts the specs are built on — FP formats, FLOPS, and the difference between CUDA and Tensor cores. No marketing, real dense figures.

 


Two generations, two philosophies

The H200 and B300 are a full architecture generation apart. The H200 belongs to the Hopper generation and shares its silicon (the GH100 die) with the H100 — it is essentially an H100 with an upgraded memory subsystem. The B300 (officially Blackwell Ultra) is new silicon with a dual-reticle design: two full dies connected by a high-speed NV-HBI interconnect, totaling 208 billion transistors.

The crucial difference, however, is in intent. The H200 remains a general-purpose accelerator for both AI and HPC — it retains full FP64 performance for scientific simulations. The B300, by contrast, is tuned primarily for inference, reasoning models, and MoE architectures; its FP64 performance is deliberately cut to nearly zero. With this, NVIDIA clearly signals a shift toward AI workloads at the expense of classic HPC.

Specification comparison

Values are per GPU, SXM variant, in dense mode (without 2:4 structured sparsity, which roughly doubles the figures).

Parameter H200 (SXM) B300 (Blackwell Ultra)
ArchitectureHopper (GH100 die)Blackwell Ultra (2 dies)
Transistors~80 B208 B
Memory141 GB HBM3e288 GB HBM3e (12-high)
Memory bandwidth4.8 TB/s8 TB/s
FP4 (dense)no native support~15 PFLOPS
FP8 (dense)~1,979 TFLOPS~7,000 TFLOPS
FP16 / BF16 (dense)~989 TFLOPS~3,500 TFLOPS
FP6434 TFLOPS (Tensor 67)~1.25 TFLOPS
CUDA cores16,89620,480
Tensor cores528 (4th gen)640 (5th gen)
SM132160
TDP700 W1,400 W
NVLink4th gen, 900 GB/s5th gen, 1.8 TB/s
PCIeGen5Gen6
Focusgeneral AI + HPCinference / reasoning

Note: The H200 also comes in an NVL variant (PCIe card, 600 W, no NVSwitch fabric). The B300 is deployed in practice in the rack-scale GB300 NVL72 system — 72 GPUs and 36 Grace CPUs with liquid cooling.

FP formats: why performance is listed per precision

FP stands for floating point, and the number indicates how many bits represent a single number. Each format splits the bits into sign, exponent (range), and mantissa (precision). Fewer bits mean lower precision, but also more operations per second and a smaller memory footprint.

Format Bits Typical use
FP6464Scientific simulations, physics, HPC
FP3232General GPU compute, older training
FP1616Training and inference, smaller range
BF1616Training — larger exponent, more stable
FP88Inference, increasingly LLM training too
FP4 / NVFP44Extreme quantization, the main Blackwell Ultra format

This explains the rows in the table above: the same hardware (H200) delivers ~34 TFLOPS in FP64 but ~1,979 TFLOPS in FP8 — lower precision means a simpler operation and more parallelism. The process of converting a model from higher precision to lower is called quantization, and it is why the B300, at 15 PFLOPS in FP4, is so powerful for LLM inference: large models tolerate low precision.

What FLOPS means

FLOPS = Floating Point Operations Per Second, the number of floating-point operations performed per second. The prefixes scale by order of magnitude: GFLOPS (109), TFLOPS — teraflops (1012), PFLOPS — petaflops (1015), EFLOPS — exaflops (1018).

Two distinctions matter when reading a datasheet. First, dense vs sparse: sparse figures tend to be roughly 2× higher but require a specific weight pattern (2:4 structured sparsity), so dense is the more realistic number. Second, theoretical peak vs real-world performance: the listed FLOPS are a theoretical maximum, and actual performance is often lower because the GPU waits on data from memory. That is why, for memory-bound inference, memory bandwidth matters as much as FLOPS — and why the H200, at 4.8 TB/s, benefits more from memory than from compute.

CUDA cores vs Tensor cores

Both are compute units inside the GPU, but they differ in what they compute. CUDA cores are general-purpose: each core performs one scalar operation per clock, and a GPU has thousands of them. They handle any type of computation and various precisions, including FP64. For matrix multiplication, however, they are not the most efficient.

Tensor cores are specialized for one operation — matrix multiply-accumulate (MMA, in the form D = A × B + C). Instead of a scalar, they process an entire small matrix at once, which is precisely the computation that dominates both training and inference of neural networks. A single Tensor core therefore replaces dozens of CUDA cores for this type of task and supports the low precisions FP16/FP8/FP4.


CUDA cores Tensor cores
Taskgeneral computationmatrix multiplication
Unit of workscalarmatrix
PrecisionsFP64, FP32, INT…FP16, FP8, FP4, BF16…
Count per GPUthousands (H200: 16,896)hundreds (H200: 528)
UseHPC, graphics, general codeAI training and inference

This closes the loop. When a datasheet lists “15 PFLOPS FP4,” that is Tensor core performance. FP64 (H200: 34 TFLOPS), by contrast, comes mainly from CUDA cores. The B300 therefore has enormous Tensor performance at low precisions but negligible FP64 — NVIDIA equipped it with few double-precision units because it targets AI, not scientific simulation. Tensor cores also evolve with each generation: the H200 carries the 4th, the B300 the 5th with support for the new NVFP4 format.

Summary

The B300 offers roughly double the memory and bandwidth and several times higher performance at low precisions, but practically zero FP64 at twice the power draw. For large-scale LLM inference and reasoning, it is significantly more efficient. The H200 remains the choice where a mixed AI + HPC workload with full double precision is required.

Values are based on public NVIDIA datasheets and specifications as of 2026. The listed FLOPS are theoretical dense figures and vary in practice depending on the workload.

No comments:

Post a Comment