My Personal IT Infrastructure Knowledge Base: NVIDIA H200 vs B300: precisions, cores, and what the datasheet numbers mean

Hopper against Blackwell Ultra. This piece covers not just the specs of both accelerators, but also the concepts the specs are built on — FP formats, FLOPS, and the difference between CUDA and Tensor cores. No marketing, real dense figures.

Two generations, two philosophies

The H200 and B300 are a full architecture generation apart. The H200 belongs to the Hopper generation and shares its silicon (the GH100 die) with the H100 — it is essentially an H100 with an upgraded memory subsystem. The B300 (officially Blackwell Ultra) is new silicon with a dual-reticle design: two full dies connected by a high-speed NV-HBI interconnect, totaling 208 billion transistors.

The crucial difference, however, is in intent. The H200 remains a general-purpose accelerator for both AI and HPC — it retains full FP64 performance for scientific simulations. The B300, by contrast, is tuned primarily for inference, reasoning models, and MoE architectures; its FP64 performance is deliberately cut to nearly zero. With this, NVIDIA clearly signals a shift toward AI workloads at the expense of classic HPC.

Specification comparison

Values are per GPU, SXM variant, in dense mode (without 2:4 structured sparsity, which roughly doubles the figures).

Parameter	H200 (SXM)	B300 (Blackwell Ultra)
Architecture	Hopper (GH100 die)	Blackwell Ultra (2 dies)
Transistors	~80 B	208 B
Memory	141 GB HBM3e	288 GB HBM3e (12-high)
Memory bandwidth	4.8 TB/s	8 TB/s
FP4 (dense)	no native support	~15 PFLOPS
FP8 (dense)	~1,979 TFLOPS	~7,000 TFLOPS
FP16 / BF16 (dense)	~989 TFLOPS	~3,500 TFLOPS
FP64	34 TFLOPS (Tensor 67)	~1.25 TFLOPS
CUDA cores	16,896	20,480
Tensor cores	528 (4th gen)	640 (5th gen)
SM	132	160
TDP	700 W	1,400 W
NVLink	4th gen, 900 GB/s	5th gen, 1.8 TB/s
PCIe	Gen5	Gen6
Focus	general AI + HPC	inference / reasoning

Note: The H200 also comes in an NVL variant (PCIe card, 600 W, no NVSwitch fabric). The B300 is deployed in practice in the rack-scale GB300 NVL72 system — 72 GPUs and 36 Grace CPUs with liquid cooling.

FP formats: why performance is listed per precision

FP stands for floating point, and the number indicates how many bits represent a single number. Each format splits the bits into sign, exponent (range), and mantissa (precision). Fewer bits mean lower precision, but also more operations per second and a smaller memory footprint.

Format	Bits	Typical use
FP64	64	Scientific simulations, physics, HPC
FP32	32	General GPU compute, older training
FP16	16	Training and inference, smaller range
BF16	16	Training — larger exponent, more stable
FP8	8	Inference, increasingly LLM training too
FP4 / NVFP4	4	Extreme quantization, the main Blackwell Ultra format

This explains the rows in the table above: the same hardware (H200) delivers ~34 TFLOPS in FP64 but ~1,979 TFLOPS in FP8 — lower precision means a simpler operation and more parallelism. The process of converting a model from higher precision to lower is called quantization, and it is why the B300, at 15 PFLOPS in FP4, is so powerful for LLM inference: large models tolerate low precision.

What FLOPS means

FLOPS = Floating Point Operations Per Second, the number of floating-point operations performed per second. The prefixes scale by order of magnitude: GFLOPS (10⁹), TFLOPS — teraflops (10¹²), PFLOPS — petaflops (10¹⁵), EFLOPS — exaflops (10¹⁸).

Two distinctions matter when reading a datasheet. First, dense vs sparse: sparse figures tend to be roughly 2× higher but require a specific weight pattern (2:4 structured sparsity), so dense is the more realistic number. Second, theoretical peak vs real-world performance: the listed FLOPS are a theoretical maximum, and actual performance is often lower because the GPU waits on data from memory. That is why, for memory-bound inference, memory bandwidth matters as much as FLOPS — and why the H200, at 4.8 TB/s, benefits more from memory than from compute.

CUDA cores vs Tensor cores

Both are compute units inside the GPU, but they differ in what they compute. CUDA cores are general-purpose: each core performs one scalar operation per clock, and a GPU has thousands of them. They handle any type of computation and various precisions, including FP64. For matrix multiplication, however, they are not the most efficient.

Tensor cores are specialized for one operation — matrix multiply-accumulate (MMA, in the form D = A × B + C). Instead of a scalar, they process an entire small matrix at once, which is precisely the computation that dominates both training and inference of neural networks. A single Tensor core therefore replaces dozens of CUDA cores for this type of task and supports the low precisions FP16/FP8/FP4.

	CUDA cores	Tensor cores
Task	general computation	matrix multiplication
Unit of work	scalar	matrix
Precisions	FP64, FP32, INT…	FP16, FP8, FP4, BF16…
Count per GPU	thousands (H200: 16,896)	hundreds (H200: 528)
Use	HPC, graphics, general code	AI training and inference

This closes the loop. When a datasheet lists “15 PFLOPS FP4,” that is Tensor core performance. FP64 (H200: 34 TFLOPS), by contrast, comes mainly from CUDA cores. The B300 therefore has enormous Tensor performance at low precisions but negligible FP64 — NVIDIA equipped it with few double-precision units because it targets AI, not scientific simulation. Tensor cores also evolve with each generation: the H200 carries the 4th, the B300 the 5th with support for the new NVFP4 format.

Summary

The B300 offers roughly double the memory and bandwidth and several times higher performance at low precisions, but practically zero FP64 at twice the power draw. For large-scale LLM inference and reasoning, it is significantly more efficient. The H200 remains the choice where a mixed AI + HPC workload with full double precision is required.

Values are based on public NVIDIA datasheets and specifications as of 2026. The listed FLOPS are theoretical dense figures and vary in practice depending on the workload.

My Personal IT Infrastructure Knowledge Base

Pages

Monday, June 22, 2026

NVIDIA H200 vs B300: precisions, cores, and what the datasheet numbers mean