Skip to content

Guide · under-4k

Local LLM rig under $4k (2026)

The minimum-viable workstation for serious local inference: single 5090, 64 GB system RAM, fast NVMe, and a case that holds up under sustained load.

Job-to-be-done · Run 70B-class models at home, without the cloud bill.

Measured

32b-class-q8 · tokens_per_second28–34 tokens_per_second
32b-class-q8 · usable_context_tokens_k32 usable_context_tokens_k
32b-class-q8 · cold_start_seconds4 cold_start_seconds
70b-class-iq3-m · tokens_per_second9–14 tokens_per_second
70b-class-iq3-m · usable_context_tokens_k8 usable_context_tokens_k

Bars scaled to largest value in set

The job

You want to run sizeable local models at home for development, research, or writing — at usable speed, with a context window that doesn't force you to chunk everything. You're allergic to the monthly cloud bill. You have ~$4k to spend and you want the rig to still feel fast in eighteen months.

This guide is not for:

  • Fine-tuning from scratch (you need more VRAM or a multi-GPU rig).
  • Pure image/video generation (different tradeoffs, covered in a separate guide).
  • Production inference serving (this is a workstation, not a datacenter node).

The build

PartPickWhy
GPUNVIDIA RTX 5090 (32 GB)32B-class at Q8 with full context; 70B at IQ3.
CPUAMD Ryzen 9 9950X or similar 16-coreYou'll bottleneck on single-thread + some lanes.
RAM64 GB DDR5-6000 (2×32)Leaves room for KV-cache spill + tooling.
Storage2 TB PCIe 4.0 NVMeModel weights + datasets + Docker images.
PSU1000 W 80+ Gold, single rail5090 is serious; don't be clever here.
CaseAirflow-first mid-tower; 3× intake / 2× exh.Sustained loads run for hours.
OSWindows 11 Pro or Ubuntu 24.04Your call. Both work; drivers are mature.

Numbers

Approximate inference throughput on this build with llama.cpp, short prompt:

  • 32B-class at Q8 — ~28–34 tok/s, full 32k context fits.
  • 70B-class at IQ3_M — ~9–14 tok/s, ~8k context before KV pressure.
  • Cold start dominated by model load from NVMe (~4 seconds for a 32B Q8).

Your mileage will vary with prompt shape and sampler choice. The 32B-class sweet spot is where this rig shines; 70B-class is doable but tight.

Tradeoffs

  • Dual 4090 instead of a single 5090. Higher aggregate VRAM (48 GB), but you lose the clean single-card setup, and a lot of local-inference tooling doesn't cleanly split across two cards without effort.
  • Threadripper instead of Ryzen 9. More PCIe lanes, more cores, more money. If you'll add a second GPU in year two, worth it. If not, skip.
  • Cloud on-demand. Breaks even with this rig around ~18 months of heavy use, depending on your cloud tier.

What this doesn't get you

  • Multi-GPU training. You need NVLink, more lanes, more PSU headroom.
  • Proper datacenter-style serving (batching, multi-user concurrency).
  • A good excuse. Buy the rig.