The job
You are an infra lead. The CFO asked why the inference bill is what it is, and you have to answer with a spreadsheet. You serve a production LLM workload to internal users or paying customers. Latency targets are written down. Utilization is high enough that renting Hopper-class instances on demand stopped being clever about a quarter ago.
This guide picks parts for one 19-inch rack. The question it answers: do you colocate your own Blackwell-class gear, or keep renting? The short version is that owning wins above roughly 70% utilization on a horizon longer than a year. Below that, rent.
This guide is not for you if:
- You need multi-rack scale-out. NVLink is rack-bounded; spanning racks is a different blueprint.
- You're pretraining a frontier model. This rack serves; it does not pretrain.
- Your facility can't deliver 35+ kW per cabinet with rear-door cooling. Different conversation.
The build
| Item | Pick | Why |
|---|---|---|
| Compute | 2-3x OEM 4U servers with NVIDIA HGX B200 baseboard | 8x B200 per node, 144 PFLOPS FP4 sparse, air-cooled, OEM choice |
| Memory | 1,400 GB HBM3e per baseboard; 2 TB system DRAM | Fits trillion-parameter weights resident; enough host RAM for KV offload |
| Networking | ConnectX-7 NICs at 400 Gbps; BlueField-3 DPUs | East-west fabric for tensor parallel; DPU offloads storage and security |
| Storage | NVMe all-flash tier, separate from compute nodes | Weights and logs survive a node swap; keeps the GPU chassis simple |
| Power | 2x 30A 208V three-phase PDUs, A+B feeds | One HGX node draws ~10 kW; three nodes plus fabric clears 35 kW |
| Cooling | Rear-door heat exchanger or hot-aisle containment | Air-cooled HGX/DGX B200 lets you skip facility water plumbing |
| Software | Triton, vLLM or TensorRT-LLM, Kubernetes, MIG | Open serving stack; MIG partitions a B200 for smaller models |
Numbers
- HGX B200 baseboard — 8 GPUs, 1,400 GB total HBM3e, 14.4 TB/s NVLink, 144 PFLOPS FP4 sparse.
- DGX B200 appliance — 10U, 14.3 kW max, 2 TB system memory, 4 OSFP ports at 400 Gbps each, 2 BlueField-3 DPUs.
- GB200 NVL72 — 72 Blackwell GPUs and 36 Grace CPUs in one rack, 13.4 TB HBM3e, ~120 kW per rack, liquid-cooled.
- DGX H200 — 8 GPUs, 1,128 GB HBM3e, 10.2 kW max. The non-liquid Hopper fallback.
Tradeoffs
- GB200 NVL72 instead — Treat the whole rack as one platform. Top throughput per rack-U and the right call if you train as well as serve. The catch is liquid: ~120 kW and facility water are non-negotiable. If your colo cannot deliver chilled water to the cabinet, this is a non-starter regardless of price.
- Hopper (DGX H200) instead — 10.2 kW per node fits a standard air-cooled cabinet without rear-door heat exchangers. Memory per system is 1,128 GB HBM3e, which is enough for most production serving workloads today. You give up the FP4 throughput Blackwell brings; if your workload is INT8 or FP8, the gap is smaller than the spec sheet suggests.
- Cloud rental instead — Reserved Blackwell capacity from a hyperscaler is the right call below ~70% utilization or under a one-year horizon. Above that, the rack pays itself off and the next year is gross margin. Run the math with your actual contract pricing, your actual colo quote, and a three-year depreciation. If the answer is close, rent — owning is only worth it when the answer is obvious.
What this doesn't get you
- Multi-rack scale-out. NVLink is rack-bounded; spanning racks means InfiniBand or Ethernet and a different blueprint.
- Training a frontier model from scratch. This rack serves; it does not pretrain.
- A solved facility problem. Power density and heat rejection are the gating decisions, not the GPU SKU.