The job
You want to run sizeable local models at home for development, research, or writing — at usable speed, with a context window that doesn't force you to chunk everything. You're allergic to the monthly cloud bill. You have ~$4k to spend and you want the rig to still feel fast in eighteen months.
This guide is not for:
- Fine-tuning from scratch (you need more VRAM or a multi-GPU rig).
- Pure image/video generation (different tradeoffs, covered in a separate guide).
- Production inference serving (this is a workstation, not a datacenter node).
The build
| Part | Pick | Why |
|---|---|---|
| GPU | NVIDIA RTX 5090 (32 GB) | 32B-class at Q8 with full context; 70B at IQ3. |
| CPU | AMD Ryzen 9 9950X or similar 16-core | You'll bottleneck on single-thread + some lanes. |
| RAM | 64 GB DDR5-6000 (2×32) | Leaves room for KV-cache spill + tooling. |
| Storage | 2 TB PCIe 4.0 NVMe | Model weights + datasets + Docker images. |
| PSU | 1000 W 80+ Gold, single rail | 5090 is serious; don't be clever here. |
| Case | Airflow-first mid-tower; 3× intake / 2× exh. | Sustained loads run for hours. |
| OS | Windows 11 Pro or Ubuntu 24.04 | Your call. Both work; drivers are mature. |
Numbers
Approximate inference throughput on this build with llama.cpp, short prompt:
- 32B-class at Q8 — ~28–34 tok/s, full 32k context fits.
- 70B-class at IQ3_M — ~9–14 tok/s, ~8k context before KV pressure.
- Cold start dominated by model load from NVMe (~4 seconds for a 32B Q8).
Your mileage will vary with prompt shape and sampler choice. The 32B-class sweet spot is where this rig shines; 70B-class is doable but tight.
Tradeoffs
- Dual 4090 instead of a single 5090. Higher aggregate VRAM (48 GB), but you lose the clean single-card setup, and a lot of local-inference tooling doesn't cleanly split across two cards without effort.
- Threadripper instead of Ryzen 9. More PCIe lanes, more cores, more money. If you'll add a second GPU in year two, worth it. If not, skip.
- Cloud on-demand. Breaks even with this rig around ~18 months of heavy use, depending on your cloud tier.
What this doesn't get you
- Multi-GPU training. You need NVLink, more lanes, more PSU headroom.
- Proper datacenter-style serving (batching, multi-user concurrency).
- A good excuse. Buy the rig.