The NVIDIA Stack Hyperscalers Buy (2026): Hardware + Networking + Software, Generation by Generation
When people say “hyperscalers buy NVIDIA,” they don’t mean “a few GPUs.”
They mean a full, vertically-optimized AI factory stack: GPUs, CPUs, interconnect, networking, DPUs/NICs, and the software layer that turns thousands of devices into one training/inference machine.
Below is the stack from silicon → rack → cluster → software, plus how it evolved from A100 → H100 → GH200 → GB200/GB300 → Rubin, and what each generation is typically used for.
1) The full NVIDIA stack (what’s actually being sold)
Layer 1: Compute (GPU + CPU)
GPUs (the engine): A100 (Ampere), H100 (Hopper), Blackwell (B200 / GB200), Blackwell Ultra (GB300), then Rubin (Vera Rubin platform).
CPUs (the feeder/orchestrator): Grace CPU in the Grace Hopper / Grace Blackwell systems; Vera CPU in the Rubin platform.
Layer 2: Scale-up fabric (inside a node / inside a rack)
This is what lets many GPUs behave like “one big GPU”:
NVLink / NVLink Switch System in rack-scale designs like GB200 NVL72 (and successors). Example: GB200 NVL72 highlights a 72-GPU NVLink domain with 130 TB/s GPU communications.
Layer 3: Scale-out fabric (between nodes / across racks)
This is what makes 1,000–100,000 GPUs act like one cluster:
InfiniBand (NVIDIA Quantum line) and now Quantum-X800 for 800G-class fabrics.
Ethernet for AI (Spectrum-X), tuned for large GPU clusters (NVIDIA claims improved efficiency and performance at very large scales).
Layer 4: NICs / DPUs (the “infrastructure offload” layer)
ConnectX SuperNICs (cluster endpoint networking; now up to 800 Gb/s in ConnectX-8).
BlueField DPUs (offload networking, storage, security; BlueField-3 supports 400Gb/s connectivity).
Layer 5: Software stack (the real lock-in)
This is why “NVIDIA performance” often means more than raw FLOPS:
CUDA + libraries (the foundation)
Training & multi-GPU comms (collectives, kernels, Transformer Engine, etc.)
Inference stack: TensorRT, TensorRT-LLM, Triton Inference Server, NIM microservices for packaged deployment
2) What hyperscalers use this stack for (practically)
A) Training frontier models (LLMs, multimodal, MoE)
Needs: maximum GPU utilization, fast GPU↔GPU comms, high HBM bandwidth/capacity, efficient scale-out networking.
The stack: NVLink scale-up + InfiniBand/Ethernet scale-out + optimized training software.
B) Inference at scale (“tokens per dollar”)
Needs: memory capacity (KV cache), high throughput, predictable latency, and strong multi-tenancy.
This is why newer platforms emphasize bigger HBM and “AI factory” inference throughput (e.g., Blackwell Ultra memory scaling, rack-scale systems).
C) Data analytics + retrieval + recommender systems
Needs: throughput, memory bandwidth, and fast networking to keep pipelines moving.
Often paired with DPUs for storage/network offload.
D) Sovereign AI / private AI clouds
Same stack, but sold as an “AI supercomputer-in-a-box” concept (rack-scale systems + management + security).
3) Generation cheat sheet: what changed each time
Ampere (A100): “modern tensor acceleration becomes standard”
Typical role: training and inference for “classic” transformer era, broad HPC.
Example specs include up to 80GB HBM2e and ~2 TB/s memory bandwidth class on SXM.
Hopper (H100): “transformer-era acceleration goes hardcore”
Typical role: big jump for transformer workloads, better perf-per-watt, more mature large-model toolchain.
Example: FP8 Tensor Core performance and higher memory bandwidth (H100 product specs list up to 3.35–3.9 TB/s memory bandwidth depending on model).
Grace Hopper (GH200): “CPU+GPU tight coupling for memory-heavy AI”
Typical role: models/workloads that are memory-capacity and memory-bandwidth constrained.
Example: GH200 NVL2 positioning highlights 288GB HBM and ~10 TB/s memory bandwidth in that configuration.
Blackwell rack-scale (GB200 NVL72): “the rack becomes the computer”
Typical role: real-time trillion-parameter inference and large training clusters, where scale-up matters a lot.
Example: 72 Blackwell GPUs with NVLink Switch System providing 130 TB/s GPU communications.
Blackwell Ultra (GB300 NVL72): “more memory + better reasoning throughput”
Typical role: long-context inference and high-concurrency serving (“AI reasoning throughput” emphasis).
NVIDIA positions Blackwell Ultra as 1.5× larger HBM3e memory plus added AI compute over its predecessor in the GB300 NVL72 context.
Rubin platform (Vera Rubin NVL72): “6 co-designed chips = one AI supercomputer”
This is NVIDIA making the “platform” explicit:
Vera CPU
Rubin GPU
NVLink 6
ConnectX-9 SuperNICs
BlueField-4 DPUs
(plus the ecosystem around it)
NVIDIA describes Vera Rubin NVL72 as a unified system combining those parts.
4) Networking: why it’s half the story now
Once you scale beyond a single node/rack, networking determines whether you get:
90%+ utilization
ora very expensive space heater.
Key pieces hyperscalers buy:
Quantum / InfiniBand (e.g., Quantum-2 400G-class era, now Quantum-X800 with 800 Gb/s connectivity and high radix)
Spectrum-X Ethernet (standards-based Ethernet tuned/validated for AI clouds; NVIDIA claims improved efficiency at huge scale)
ConnectX SuperNICs at the endpoints (up to 800 Gb/s in ConnectX-8)
BlueField DPUs to offload networking/storage/security (BlueField-3 supports 400Gb/s)
5) “Performance” in 2026 is not one number
Hyperscalers care less about peak TFLOPS and more about:
time-to-train (weeks vs months)
tokens per dollar
tokens per watt
latency at P99 (tail latency)
cluster efficiency at scale (10k–100k GPU behavior)
That’s why each generation push looks like:
more/better Tensor Core modes (FP8/FP4 focus),
more HBM and better memory bandwidth,
better scale-up (NVLink/NVSwitch),
better scale-out (InfiniBand/Ethernet + NICs),
more packaged software (TensorRT/Triton/NIM) to turn hardware into usable production systems.
6) The bottom line
NVIDIA isn’t selling “a GPU.”
NVIDIA is selling a full AI data center architecture, where each part exists to keep the GPUs productive at insane scale.
That’s why hyperscalers buy:
rack-scale systems (NVL72 class),
networking fabrics (InfiniBand/Ethernet),
NICs/DPUs,
and the inference/training software layer.
Because in 2026, the winning product is not a chip.
It’s the AI factory.