Hopper against Blackwell Ultra. This piece covers not just the specs of both accelerators, but also the concepts the specs are built on — FP formats, FLOPS, and the difference between CUDA and Tensor cores. No marketing, real dense figures.
Two generations, two philosophies
The H200 and B300 are a full architecture generation apart. The H200 belongs to the Hopper generation and shares its silicon (the GH100 die) with the H100 — it is essentially an H100 with an upgraded memory subsystem. The B300 (officially Blackwell Ultra) is new silicon with a dual-reticle design: two full dies connected by a high-speed NV-HBI interconnect, totaling 208 billion transistors.
The crucial difference, however, is in intent. The H200 remains a general-purpose accelerator for both AI and HPC — it retains full FP64 performance for scientific simulations. The B300, by contrast, is tuned primarily for inference, reasoning models, and MoE architectures; its FP64 performance is deliberately cut to nearly zero. With this, NVIDIA clearly signals a shift toward AI workloads at the expense of classic HPC.
Specification comparison
Values are per GPU, SXM variant, in dense mode (without 2:4 structured sparsity, which roughly doubles the figures).
| Parameter | H200 (SXM) | B300 (Blackwell Ultra) |
|---|---|---|
| Architecture | Hopper (GH100 die) | Blackwell Ultra (2 dies) |
| Transistors | ~80 B | 208 B |
| Memory | 141 GB HBM3e | 288 GB HBM3e (12-high) |
| Memory bandwidth | 4.8 TB/s | 8 TB/s |
| FP4 (dense) | no native support | ~15 PFLOPS |
| FP8 (dense) | ~1,979 TFLOPS | ~7,000 TFLOPS |
| FP16 / BF16 (dense) | ~989 TFLOPS | ~3,500 TFLOPS |
| FP64 | 34 TFLOPS (Tensor 67) | ~1.25 TFLOPS |
| CUDA cores | 16,896 | 20,480 |
| Tensor cores | 528 (4th gen) | 640 (5th gen) |
| SM | 132 | 160 |
| TDP | 700 W | 1,400 W |
| NVLink | 4th gen, 900 GB/s | 5th gen, 1.8 TB/s |
| PCIe | Gen5 | Gen6 |
| Focus | general AI + HPC | inference / reasoning |
Note: The H200 also comes in an NVL variant (PCIe card, 600 W, no NVSwitch fabric). The B300 is deployed in practice in the rack-scale GB300 NVL72 system — 72 GPUs and 36 Grace CPUs with liquid cooling.
FP formats: why performance is listed per precision
FP stands for floating point, and the number indicates how many bits represent a single number. Each format splits the bits into sign, exponent (range), and mantissa (precision). Fewer bits mean lower precision, but also more operations per second and a smaller memory footprint.
| Format | Bits | Typical use |
|---|---|---|
| FP64 | 64 | Scientific simulations, physics, HPC |
| FP32 | 32 | General GPU compute, older training |
| FP16 | 16 | Training and inference, smaller range |
| BF16 | 16 | Training — larger exponent, more stable |
| FP8 | 8 | Inference, increasingly LLM training too |
| FP4 / NVFP4 | 4 | Extreme quantization, the main Blackwell Ultra format |
This explains the rows in the table above: the same hardware (H200) delivers ~34 TFLOPS in FP64 but ~1,979 TFLOPS in FP8 — lower precision means a simpler operation and more parallelism. The process of converting a model from higher precision to lower is called quantization, and it is why the B300, at 15 PFLOPS in FP4, is so powerful for LLM inference: large models tolerate low precision.
What FLOPS means
FLOPS = Floating Point Operations Per Second, the number of floating-point operations performed per second. The prefixes scale by order of magnitude: GFLOPS (109), TFLOPS — teraflops (1012), PFLOPS — petaflops (1015), EFLOPS — exaflops (1018).
Two distinctions matter when reading a datasheet. First, dense vs sparse: sparse figures tend to be roughly 2× higher but require a specific weight pattern (2:4 structured sparsity), so dense is the more realistic number. Second, theoretical peak vs real-world performance: the listed FLOPS are a theoretical maximum, and actual performance is often lower because the GPU waits on data from memory. That is why, for memory-bound inference, memory bandwidth matters as much as FLOPS — and why the H200, at 4.8 TB/s, benefits more from memory than from compute.
CUDA cores vs Tensor cores
Both are compute units inside the GPU, but they differ in what they compute. CUDA cores are general-purpose: each core performs one scalar operation per clock, and a GPU has thousands of them. They handle any type of computation and various precisions, including FP64. For matrix multiplication, however, they are not the most efficient.
Tensor cores are specialized for one operation — matrix multiply-accumulate (MMA, in the form D = A × B + C). Instead of a scalar, they process an entire small matrix at once, which is precisely the computation that dominates both training and inference of neural networks. A single Tensor core therefore replaces dozens of CUDA cores for this type of task and supports the low precisions FP16/FP8/FP4.
| CUDA cores | Tensor cores | |
|---|---|---|
| Task | general computation | matrix multiplication |
| Unit of work | scalar | matrix |
| Precisions | FP64, FP32, INT… | FP16, FP8, FP4, BF16… |
| Count per GPU | thousands (H200: 16,896) | hundreds (H200: 528) |
| Use | HPC, graphics, general code | AI training and inference |
This closes the loop. When a datasheet lists “15 PFLOPS FP4,” that is Tensor core performance. FP64 (H200: 34 TFLOPS), by contrast, comes mainly from CUDA cores. The B300 therefore has enormous Tensor performance at low precisions but negligible FP64 — NVIDIA equipped it with few double-precision units because it targets AI, not scientific simulation. Tensor cores also evolve with each generation: the H200 carries the 4th, the B300 the 5th with support for the new NVFP4 format.
Summary
The B300 offers roughly double the memory and bandwidth and several times higher performance at low precisions, but practically zero FP64 at twice the power draw. For large-scale LLM inference and reasoning, it is significantly more efficient. The H200 remains the choice where a mixed AI + HPC workload with full double precision is required.
Values are based on public NVIDIA datasheets and specifications as of 2026. The listed FLOPS are theoretical dense figures and vary in practice depending on the workload.


No comments:
Post a Comment