Item: NVIDIA GeForce RTX 5090
Rating: 9
Author: MadCoolStuff

32 GB VRAM and Blackwell sm_120, enough to run 32B-class models at high quants without paging or 70B at IQ3 with care. Worth the jump from a 4090 if you live in llama.cpp.

What we tested

A 5090 in a single-GPU workstation running local inference on llama.cpp, vLLM, and ComfyUI. Workloads picked to reflect what a practitioner actually does, not what a leaderboard cares about:

32B-class at Q8 on llama.cpp with 32k context, streaming.
70B-class at IQ3_M on llama.cpp with 8k context, streaming — tight.
SDXL + Flux image generation through ComfyUI with a LoRA stack.
Wan 2.2 video generation via Wan2GP — sustained compute and VRAM.

What you'll feel

The shift versus a 4090 is the size of model that fits before you have to start thinking about it. 24 GB forces decisions on quant tier and KV cache budget; 32 GB lets a 32B-class model run at Q8 with full context, no juggle. 70B-class still requires aggressive quantization (IQ3 / IQ2) on a single card — the 32 GB does not change that.

On pure throughput for short prompts, the gap is smaller than the spec sheet suggests. Both cards are memory-bandwidth-bound on most real workloads. The 5090 earns its premium where context grows long, batches fatten, or the model brushes the ceiling.

Setup notes (if you're upgrading)

CUDA 13.2 + cuDNN 9.20 is the current-known-good combo. Don't mix with a CUDA 12 install; dependency resolution gets ugly.
Skip xformers. Install SageAttention 2.2.0 — it doesn't force a torch downgrade and perf is within noise.
cu128 or cu130 PyTorch builds are mandatory for sm_120.

Who should buy

32B-class local at Q8 with full context — that's the sweet spot.
70B-class with IQ3 quants and modest context, if a Mac Studio's first-token latency would kill the workflow.
Image / video generation pipelines that lived inside 24 GB but bumped the ceiling on long videos or large LoRA stacks.

Who should skip

24 GB has been enough. Bandwidth and thermals improve, but the model-class ceiling barely shifts unless you were already at it.
Cloud-only workflows. A 5090 is a workstation buy, not a datacenter play.

Bottom line

If you're buying a new workstation for local AI work in 2026, this is the default. 32 GB lifts the quant-vs-context squeeze a tier — not by enough to make 70B-class trivial, but enough to stop budgeting for it on most workloads.

RTX 5090 for local LLM inference: the new watermark

Pros

Cons