The NVIDIA Stack Hyperscalers Buy (2026): Hardware + Networking + Software, Generation by Generation

Jan 15

When people say “hyperscalers buy NVIDIA,” they don’t mean “a few GPUs.”

They mean a full, vertically-optimized AI factory stack: GPUs, CPUs, interconnect, networking, DPUs/NICs, and the software layer that turns thousands of devices into one training/inference machine.

Below is the stack from silicon → rack → cluster → software, plus how it evolved from A100 → H100 → GH200 → GB200/GB300 → Rubin, and what each generation is typically used for.

1) The full NVIDIA stack (what’s actually being sold)

Layer 1: Compute (GPU + CPU)

GPUs (the engine): A100 (Ampere), H100 (Hopper), Blackwell (B200 / GB200), Blackwell Ultra (GB300), then Rubin (Vera Rubin platform).
CPUs (the feeder/orchestrator): Grace CPU in the Grace Hopper / Grace Blackwell systems; Vera CPU in the Rubin platform.

Layer 2: Scale-up fabric (inside a node / inside a rack)

This is what lets many GPUs behave like “one big GPU”:

NVLink / NVLink Switch System in rack-scale designs like GB200 NVL72 (and successors). Example: GB200 NVL72 highlights a 72-GPU NVLink domain with 130 TB/s GPU communications.

Layer 3: Scale-out fabric (between nodes / across racks)

This is what makes 1,000–100,000 GPUs act like one cluster:

InfiniBand (NVIDIA Quantum line) and now Quantum-X800 for 800G-class fabrics.
Ethernet for AI (Spectrum-X), tuned for large GPU clusters (NVIDIA claims improved efficiency and performance at very large scales).

Layer 4: NICs / DPUs (the “infrastructure offload” layer)

ConnectX SuperNICs (cluster endpoint networking; now up to 800 Gb/s in ConnectX-8).
BlueField DPUs (offload networking, storage, security; BlueField-3 supports 400Gb/s connectivity).

Layer 5: Software stack (the real lock-in)

This is why “NVIDIA performance” often means more than raw FLOPS:

CUDA + libraries (the foundation)
Training & multi-GPU comms (collectives, kernels, Transformer Engine, etc.)
Inference stack: TensorRT, TensorRT-LLM, Triton Inference Server, NIM microservices for packaged deployment

2) What hyperscalers use this stack for (practically)

A) Training frontier models (LLMs, multimodal, MoE)

Needs: maximum GPU utilization, fast GPU↔GPU comms, high HBM bandwidth/capacity, efficient scale-out networking.
The stack: NVLink scale-up + InfiniBand/Ethernet scale-out + optimized training software.

B) Inference at scale (“tokens per dollar”)

Needs: memory capacity (KV cache), high throughput, predictable latency, and strong multi-tenancy.
This is why newer platforms emphasize bigger HBM and “AI factory” inference throughput (e.g., Blackwell Ultra memory scaling, rack-scale systems).

C) Data analytics + retrieval + recommender systems

Needs: throughput, memory bandwidth, and fast networking to keep pipelines moving.
Often paired with DPUs for storage/network offload.

D) Sovereign AI / private AI clouds

Same stack, but sold as an “AI supercomputer-in-a-box” concept (rack-scale systems + management + security).

3) Generation cheat sheet: what changed each time

Ampere (A100): “modern tensor acceleration becomes standard”

Typical role: training and inference for “classic” transformer era, broad HPC.
Example specs include up to 80GB HBM2e and ~2 TB/s memory bandwidth class on SXM.

Hopper (H100): “transformer-era acceleration goes hardcore”

Typical role: big jump for transformer workloads, better perf-per-watt, more mature large-model toolchain.
Example: FP8 Tensor Core performance and higher memory bandwidth (H100 product specs list up to 3.35–3.9 TB/s memory bandwidth depending on model).

Grace Hopper (GH200): “CPU+GPU tight coupling for memory-heavy AI”

Typical role: models/workloads that are memory-capacity and memory-bandwidth constrained.
Example: GH200 NVL2 positioning highlights 288GB HBM and ~10 TB/s memory bandwidth in that configuration.

Blackwell rack-scale (GB200 NVL72): “the rack becomes the computer”

Typical role: real-time trillion-parameter inference and large training clusters, where scale-up matters a lot.
Example: 72 Blackwell GPUs with NVLink Switch System providing 130 TB/s GPU communications.

Blackwell Ultra (GB300 NVL72): “more memory + better reasoning throughput”

Typical role: long-context inference and high-concurrency serving (“AI reasoning throughput” emphasis).
NVIDIA positions Blackwell Ultra as 1.5× larger HBM3e memory plus added AI compute over its predecessor in the GB300 NVL72 context.

Rubin platform (Vera Rubin NVL72): “6 co-designed chips = one AI supercomputer”

This is NVIDIA making the “platform” explicit:

Vera CPU
Rubin GPU
NVLink 6
ConnectX-9 SuperNICs
BlueField-4 DPUs
(plus the ecosystem around it)

NVIDIA describes Vera Rubin NVL72 as a unified system combining those parts.

4) Networking: why it’s half the story now

Once you scale beyond a single node/rack, networking determines whether you get:

90%+ utilization
or
a very expensive space heater.

Key pieces hyperscalers buy:

Quantum / InfiniBand (e.g., Quantum-2 400G-class era, now Quantum-X800 with 800 Gb/s connectivity and high radix)
Spectrum-X Ethernet (standards-based Ethernet tuned/validated for AI clouds; NVIDIA claims improved efficiency at huge scale)
ConnectX SuperNICs at the endpoints (up to 800 Gb/s in ConnectX-8)
BlueField DPUs to offload networking/storage/security (BlueField-3 supports 400Gb/s)

5) “Performance” in 2026 is not one number

Hyperscalers care less about peak TFLOPS and more about:

time-to-train (weeks vs months)
tokens per dollar
tokens per watt
latency at P99 (tail latency)
cluster efficiency at scale (10k–100k GPU behavior)

That’s why each generation push looks like:

more/better Tensor Core modes (FP8/FP4 focus),
more HBM and better memory bandwidth,
better scale-up (NVLink/NVSwitch),
better scale-out (InfiniBand/Ethernet + NICs),
more packaged software (TensorRT/Triton/NIM) to turn hardware into usable production systems.

6) The bottom line

NVIDIA isn’t selling “a GPU.”
NVIDIA is selling a full AI data center architecture, where each part exists to keep the GPUs productive at insane scale.

That’s why hyperscalers buy:

rack-scale systems (NVL72 class),
networking fabrics (InfiniBand/Ethernet),
NICs/DPUs,
and the inference/training software layer.

Because in 2026, the winning product is not a chip.

It’s the AI factory.

Sorca Marian

Founder, CEO & CTO of Self-Manager.net & abZGlobal.net | Senior Software Engineer

https://self-manager.net/