The NVIDIA Stack Hyperscalers Buy (2026): Hardware + Networking + Software, Generation by Generation

When people say “hyperscalers buy NVIDIA,” they don’t mean “a few GPUs.”

They mean a full, vertically-optimized AI factory stack: GPUs, CPUs, interconnect, networking, DPUs/NICs, and the software layer that turns thousands of devices into one training/inference machine.

Below is the stack from silicon → rack → cluster → software, plus how it evolved from A100 → H100 → GH200 → GB200/GB300 → Rubin, and what each generation is typically used for.

1) The full NVIDIA stack (what’s actually being sold)

Layer 1: Compute (GPU + CPU)

  • GPUs (the engine): A100 (Ampere), H100 (Hopper), Blackwell (B200 / GB200), Blackwell Ultra (GB300), then Rubin (Vera Rubin platform).

  • CPUs (the feeder/orchestrator): Grace CPU in the Grace Hopper / Grace Blackwell systems; Vera CPU in the Rubin platform.

Layer 2: Scale-up fabric (inside a node / inside a rack)

This is what lets many GPUs behave like “one big GPU”:

  • NVLink / NVLink Switch System in rack-scale designs like GB200 NVL72 (and successors). Example: GB200 NVL72 highlights a 72-GPU NVLink domain with 130 TB/s GPU communications.

Layer 3: Scale-out fabric (between nodes / across racks)

This is what makes 1,000–100,000 GPUs act like one cluster:

  • InfiniBand (NVIDIA Quantum line) and now Quantum-X800 for 800G-class fabrics.

  • Ethernet for AI (Spectrum-X), tuned for large GPU clusters (NVIDIA claims improved efficiency and performance at very large scales).

Layer 4: NICs / DPUs (the “infrastructure offload” layer)

  • ConnectX SuperNICs (cluster endpoint networking; now up to 800 Gb/s in ConnectX-8).

  • BlueField DPUs (offload networking, storage, security; BlueField-3 supports 400Gb/s connectivity).

Layer 5: Software stack (the real lock-in)

This is why “NVIDIA performance” often means more than raw FLOPS:

  • CUDA + libraries (the foundation)

  • Training & multi-GPU comms (collectives, kernels, Transformer Engine, etc.)

  • Inference stack: TensorRT, TensorRT-LLM, Triton Inference Server, NIM microservices for packaged deployment

2) What hyperscalers use this stack for (practically)

A) Training frontier models (LLMs, multimodal, MoE)

  • Needs: maximum GPU utilization, fast GPU↔GPU comms, high HBM bandwidth/capacity, efficient scale-out networking.

  • The stack: NVLink scale-up + InfiniBand/Ethernet scale-out + optimized training software.

B) Inference at scale (“tokens per dollar”)

  • Needs: memory capacity (KV cache), high throughput, predictable latency, and strong multi-tenancy.

  • This is why newer platforms emphasize bigger HBM and “AI factory” inference throughput (e.g., Blackwell Ultra memory scaling, rack-scale systems).

C) Data analytics + retrieval + recommender systems

  • Needs: throughput, memory bandwidth, and fast networking to keep pipelines moving.

  • Often paired with DPUs for storage/network offload.

D) Sovereign AI / private AI clouds

  • Same stack, but sold as an “AI supercomputer-in-a-box” concept (rack-scale systems + management + security).

3) Generation cheat sheet: what changed each time

Ampere (A100): “modern tensor acceleration becomes standard”

  • Typical role: training and inference for “classic” transformer era, broad HPC.

  • Example specs include up to 80GB HBM2e and ~2 TB/s memory bandwidth class on SXM.

Hopper (H100): “transformer-era acceleration goes hardcore”

  • Typical role: big jump for transformer workloads, better perf-per-watt, more mature large-model toolchain.

  • Example: FP8 Tensor Core performance and higher memory bandwidth (H100 product specs list up to 3.35–3.9 TB/s memory bandwidth depending on model).

Grace Hopper (GH200): “CPU+GPU tight coupling for memory-heavy AI”

  • Typical role: models/workloads that are memory-capacity and memory-bandwidth constrained.

  • Example: GH200 NVL2 positioning highlights 288GB HBM and ~10 TB/s memory bandwidth in that configuration.

Blackwell rack-scale (GB200 NVL72): “the rack becomes the computer”

  • Typical role: real-time trillion-parameter inference and large training clusters, where scale-up matters a lot.

  • Example: 72 Blackwell GPUs with NVLink Switch System providing 130 TB/s GPU communications.

Blackwell Ultra (GB300 NVL72): “more memory + better reasoning throughput”

  • Typical role: long-context inference and high-concurrency serving (“AI reasoning throughput” emphasis).

  • NVIDIA positions Blackwell Ultra as 1.5× larger HBM3e memory plus added AI compute over its predecessor in the GB300 NVL72 context.

Rubin platform (Vera Rubin NVL72): “6 co-designed chips = one AI supercomputer”

This is NVIDIA making the “platform” explicit:

  • Vera CPU

  • Rubin GPU

  • NVLink 6

  • ConnectX-9 SuperNICs

  • BlueField-4 DPUs

  • (plus the ecosystem around it)

NVIDIA describes Vera Rubin NVL72 as a unified system combining those parts.

4) Networking: why it’s half the story now

Once you scale beyond a single node/rack, networking determines whether you get:

  • 90%+ utilization
    or

  • a very expensive space heater.

Key pieces hyperscalers buy:

  • Quantum / InfiniBand (e.g., Quantum-2 400G-class era, now Quantum-X800 with 800 Gb/s connectivity and high radix)

  • Spectrum-X Ethernet (standards-based Ethernet tuned/validated for AI clouds; NVIDIA claims improved efficiency at huge scale)

  • ConnectX SuperNICs at the endpoints (up to 800 Gb/s in ConnectX-8)

  • BlueField DPUs to offload networking/storage/security (BlueField-3 supports 400Gb/s)

5) “Performance” in 2026 is not one number

Hyperscalers care less about peak TFLOPS and more about:

  • time-to-train (weeks vs months)

  • tokens per dollar

  • tokens per watt

  • latency at P99 (tail latency)

  • cluster efficiency at scale (10k–100k GPU behavior)

That’s why each generation push looks like:

  1. more/better Tensor Core modes (FP8/FP4 focus),

  2. more HBM and better memory bandwidth,

  3. better scale-up (NVLink/NVSwitch),

  4. better scale-out (InfiniBand/Ethernet + NICs),

  5. more packaged software (TensorRT/Triton/NIM) to turn hardware into usable production systems.

6) The bottom line

NVIDIA isn’t selling “a GPU.”
NVIDIA is selling a full AI data center architecture, where each part exists to keep the GPUs productive at insane scale.

That’s why hyperscalers buy:

  • rack-scale systems (NVL72 class),

  • networking fabrics (InfiniBand/Ethernet),

  • NICs/DPUs,

  • and the inference/training software layer.

Because in 2026, the winning product is not a chip.

It’s the AI factory.

Sorca Marian

Founder, CEO & CTO of Self-Manager.net & abZGlobal.net | Senior Software Engineer

https://self-manager.net/
Previous
Previous

NVIDIA GPU Stack vs Google TPU Stack (2026): Two “AI Factory” Philosophies

Next
Next

NVIDIA Also Makes CPUs Now - Here’s What They’re Used For (And Why It Matters)