What we tested
A 5090 in a single-GPU workstation running local inference on llama.cpp, vLLM, and ComfyUI. Workloads picked to reflect what a practitioner actually does, not what a leaderboard cares about:
- 32B-class at Q8 on llama.cpp with 32k context, streaming.
- 70B-class at IQ3_M on llama.cpp with 8k context, streaming — tight.
- SDXL + Flux image generation through ComfyUI with a LoRA stack.
- Wan 2.2 video generation via Wan2GP — sustained compute and VRAM.
What you'll feel
The shift versus a 4090 is the size of model that fits before you have to start thinking about it. 24 GB forces decisions on quant tier and KV cache budget; 32 GB lets a 32B-class model run at Q8 with full context, no juggle. 70B-class still requires aggressive quantization (IQ3 / IQ2) on a single card — the 32 GB does not change that.
On pure throughput for short prompts, the gap is smaller than the spec sheet suggests. Both cards are memory-bandwidth-bound on most real workloads. The 5090 earns its premium where context grows long, batches fatten, or the model brushes the ceiling.
Setup notes (if you're upgrading)
- CUDA 13.2 + cuDNN 9.20 is the current-known-good combo. Don't mix with a CUDA 12 install; dependency resolution gets ugly.
- Skip xformers. Install SageAttention 2.2.0 — it doesn't force a torch downgrade and perf is within noise.
- cu128 or cu130 PyTorch builds are mandatory for sm_120.
Who should buy
- 32B-class local at Q8 with full context — that's the sweet spot.
- 70B-class with IQ3 quants and modest context, if a Mac Studio's first-token latency would kill the workflow.
- Image / video generation pipelines that lived inside 24 GB but bumped the ceiling on long videos or large LoRA stacks.
Who should skip
- 24 GB has been enough. Bandwidth and thermals improve, but the model-class ceiling barely shifts unless you were already at it.
- Cloud-only workflows. A 5090 is a workstation buy, not a datacenter play.
Bottom line
If you're buying a new workstation for local AI work in 2026, this is the default. 32 GB lifts the quant-vs-context squeeze a tier — not by enough to make 70B-class trivial, but enough to stop budgeting for it on most workloads.
