Skip to content

Guide · datacenter

Single-rack production inference blueprint (2026)

Spec a single 19-inch rack for production LLM inference. HGX B200 in OEM 4U as the default, DGX B200 if you want the appliance, GB200 NVL72 if your facility is liquid-ready.

Job-to-be-done · Spec a single rack for company-tier LLM inference. Decide air vs liquid; own vs rent.

The job

You are an infra lead. The CFO asked why the inference bill is what it is, and you have to answer with a spreadsheet. You serve a production LLM workload to internal users or paying customers. Latency targets are written down. Utilization is high enough that renting Hopper-class instances on demand stopped being clever about a quarter ago.

This guide picks parts for one 19-inch rack. The question it answers: do you colocate your own Blackwell-class gear, or keep renting? The short version is that owning wins above roughly 70% utilization on a horizon longer than a year. Below that, rent.

This guide is not for you if:

  • You need multi-rack scale-out. NVLink is rack-bounded; spanning racks is a different blueprint.
  • You're pretraining a frontier model. This rack serves; it does not pretrain.
  • Your facility can't deliver 35+ kW per cabinet with rear-door cooling. Different conversation.

The build

ItemPickWhy
Compute2-3x OEM 4U servers with NVIDIA HGX B200 baseboard8x B200 per node, 144 PFLOPS FP4 sparse, air-cooled, OEM choice
Memory1,400 GB HBM3e per baseboard; 2 TB system DRAMFits trillion-parameter weights resident; enough host RAM for KV offload
NetworkingConnectX-7 NICs at 400 Gbps; BlueField-3 DPUsEast-west fabric for tensor parallel; DPU offloads storage and security
StorageNVMe all-flash tier, separate from compute nodesWeights and logs survive a node swap; keeps the GPU chassis simple
Power2x 30A 208V three-phase PDUs, A+B feedsOne HGX node draws ~10 kW; three nodes plus fabric clears 35 kW
CoolingRear-door heat exchanger or hot-aisle containmentAir-cooled HGX/DGX B200 lets you skip facility water plumbing
SoftwareTriton, vLLM or TensorRT-LLM, Kubernetes, MIGOpen serving stack; MIG partitions a B200 for smaller models

Numbers

  • HGX B200 baseboard — 8 GPUs, 1,400 GB total HBM3e, 14.4 TB/s NVLink, 144 PFLOPS FP4 sparse.
  • DGX B200 appliance — 10U, 14.3 kW max, 2 TB system memory, 4 OSFP ports at 400 Gbps each, 2 BlueField-3 DPUs.
  • GB200 NVL72 — 72 Blackwell GPUs and 36 Grace CPUs in one rack, 13.4 TB HBM3e, ~120 kW per rack, liquid-cooled.
  • DGX H200 — 8 GPUs, 1,128 GB HBM3e, 10.2 kW max. The non-liquid Hopper fallback.

Tradeoffs

  • GB200 NVL72 instead — Treat the whole rack as one platform. Top throughput per rack-U and the right call if you train as well as serve. The catch is liquid: ~120 kW and facility water are non-negotiable. If your colo cannot deliver chilled water to the cabinet, this is a non-starter regardless of price.
  • Hopper (DGX H200) instead — 10.2 kW per node fits a standard air-cooled cabinet without rear-door heat exchangers. Memory per system is 1,128 GB HBM3e, which is enough for most production serving workloads today. You give up the FP4 throughput Blackwell brings; if your workload is INT8 or FP8, the gap is smaller than the spec sheet suggests.
  • Cloud rental instead — Reserved Blackwell capacity from a hyperscaler is the right call below ~70% utilization or under a one-year horizon. Above that, the rack pays itself off and the next year is gross margin. Run the math with your actual contract pricing, your actual colo quote, and a three-year depreciation. If the answer is close, rent — owning is only worth it when the answer is obvious.

What this doesn't get you

  • Multi-rack scale-out. NVLink is rack-bounded; spanning racks means InfiniBand or Ethernet and a different blueprint.
  • Training a frontier model from scratch. This rack serves; it does not pretrain.
  • A solved facility problem. Power density and heat rejection are the gating decisions, not the GPU SKU.