Skip to content

Guide · under-10k

Serious local LLM workstation under $10k (2026)

Two RTX 5090s, a Threadripper for PCIe lanes, and 128 GB of DDR5. 64 GB of aggregate VRAM at the lowest cost-per-GB on this tier — and it sits on your desk.

Job-to-be-done · Run 70B+ models comfortably, multi-GPU agentic workflows, or LoRA fine-tunes — all locally.

Measured

70b-4bit-32k-tp2 · tokens_per_second30–45 tokens_per_second
70b-4bit-32k-tp2 · tensor_parallel_world_size2 tensor_parallel_world_size
70b-4bit-32k-tp2 · gpu_utilization_percent80 gpu_utilization_percent
70b-8bit-32k-tp2 · tokens_per_second18–25 tokens_per_second
123b-class-4bit · tokens_per_second12–18 tokens_per_second
lora-13b-base · batch_size_low4 batch_size_low
lora-13b-base · batch_size_high8 batch_size_high

Bars scaled to largest value in set

The job

You want to run 70B-class models at home with room to breathe. A 70B at 4-bit fits in roughly 40 GB of VRAM with a 32k context; at 8-bit it wants closer to 75 GB once you load KV cache. You also want headroom for agentic workflows that pin two or three smaller models in memory at once, or LoRA fine-tunes on mid-scale bases. You're not chasing datacenter throughput. You're not pretraining. You want a single tower that earns its keep when the cloud bill stops being a rounding error.

This guide is not for you if:

  • You need 405B-class models. Multi-node problem, different blueprint.
  • You want pretraining capacity. Different scale of compute entirely.
  • You need a quiet office. Two 5090s are loud.

The build

PartPickWhy
GPU2x NVIDIA RTX 5090 (32 GB each)64 GB aggregate VRAM at $4,000. 1,792 GB/s per card. Tensor-split is mature in vLLM and llama.cpp.
CPUAMD Threadripper 7970X (32-core)48 PCIe 5.0 lanes feed two x16 GPU slots without bifurcation tricks. ~$2,500 street.
RAM128 GB DDR5-5600 ECC RDIMM (4x32)Quad-channel matches the platform. ECC because long fine-tune runs deserve it.
Storage2 TB Samsung 990 Pro NVMe + 4 TB secondaryHot models on the fast drive; weights, datasets, checkpoints on the bulk.
PSUCorsair AX1600i (1600 W, Titanium)Two 5090s pull a peak 1,150 W under transient spikes. 1500 W is the floor; 1600 W with margin is the answer.
CaseFractal Define 7 XL or Phanteks Enthoo Pro 2E-ATX, 8+ slots, airflow for 1,150 W of GPU heat. Two 5090s need real space.
OSUbuntu 24.04 LTSCUDA 13 lands cleanly. NVIDIA's open driver is the default for Blackwell. WSL2 if Windows is non-negotiable.

Numbers

  • 70B at 4-bit, 32k context — ~30-45 tok/s with tensor-parallel-2 in vLLM. Both cards loaded ~80%.
  • 70B at 8-bit, 32k context — ~18-25 tok/s. KV cache fits across the two cards.
  • 123B-class at 4-bit — runs at 64 GB but tight; expect ~12-18 tok/s and no room for long context.
  • LoRA on a 13B base — comfortable headroom, batch size 4-8 depending on sequence length.

Tradeoffs

  • DGX Spark instead — 128 GB of unified LPDDR5X at 273 GB/s for $4,699. Twice the addressable memory but a sixth of the bandwidth per byte you actually move. Wins on the largest models that won't fit in 64 GB; loses on token throughput for everything that does.
  • Single RTX PRO 6000 Blackwell — 96 GB GDDR7 ECC on one card, 1,792 GB/s, ECC. No multi-GPU plumbing. Partner pricing typically $7,000-$9,000, which leaves a thin budget for the rest of the rig. Right answer if you hate tensor-parallel debugging or need ECC VRAM specifically.
  • Cloud H200/B200 instances — burstable for occasional 405B work. Math flips against you above ~200 hours of monthly use, and the latency is never local.

What this doesn't get you

  • 405B-class models at full precision. That's a multi-node problem.
  • Pretraining anything serious. Different scale of compute.
  • Quiet operation. Two 5090s under load are loud, and 1,150 W of heat has to go somewhere.
  • A path to NVLink. Consumer Blackwell skipped it; the cards talk over PCIe 5.0.