Skip to content

NVIDIA · gpu

Verdict · buy

RTX 5090 for local LLM inference: the new watermark

32 GB VRAM and Blackwell sm_120, enough to run 32B-class models at high quants without paging or 70B at IQ3 with care. Worth the jump from a 4090 if you live in llama.cpp.

Product
NVIDIA GeForce RTX 5090
Published
2026-04-24
Price
$1,999
Score
9 / 10
9/10
Stylized line drawing of the MSI GeForce RTX 5090 Suprim Liquid

Pros

  • 32 GB clears 32B-class models at Q8 with full context, or 70B at IQ3 quants with room left for KV cache
  • Memory bandwidth shows up on long-context decode, not just headline FLOPS
  • sm_120 enables newer CUDA paths that Ada cards leave on the table

Cons

  • Needs CUDA 13.x + cuDNN 9; older pipelines complain until you update
  • xformers on PyPI still force-downgrades Torch — SageAttention 2.2.0 is the workaround
  • 575 W TGP, 1000 W recommended PSU; plan thermals before the PO

Verified numbers

verified 2026-05-01

  • vram (GB)

    32

  • tdp (W)

    575

  • msrp (USD)

    1,999

  • context window (k tokens)

    32

  • compute capability

    sm_120

  • compute capability digits

    120

What we tested

A 5090 in a single-GPU workstation running local inference on llama.cpp, vLLM, and ComfyUI. Workloads picked to reflect what a practitioner actually does, not what a leaderboard cares about:

  • 32B-class at Q8 on llama.cpp with 32k context, streaming.
  • 70B-class at IQ3_M on llama.cpp with 8k context, streaming — tight.
  • SDXL + Flux image generation through ComfyUI with a LoRA stack.
  • Wan 2.2 video generation via Wan2GP — sustained compute and VRAM.

What you'll feel

The shift versus a 4090 is the size of model that fits before you have to start thinking about it. 24 GB forces decisions on quant tier and KV cache budget; 32 GB lets a 32B-class model run at Q8 with full context, no juggle. 70B-class still requires aggressive quantization (IQ3 / IQ2) on a single card — the 32 GB does not change that.

On pure throughput for short prompts, the gap is smaller than the spec sheet suggests. Both cards are memory-bandwidth-bound on most real workloads. The 5090 earns its premium where context grows long, batches fatten, or the model brushes the ceiling.

Setup notes (if you're upgrading)

  • CUDA 13.2 + cuDNN 9.20 is the current-known-good combo. Don't mix with a CUDA 12 install; dependency resolution gets ugly.
  • Skip xformers. Install SageAttention 2.2.0 — it doesn't force a torch downgrade and perf is within noise.
  • cu128 or cu130 PyTorch builds are mandatory for sm_120.

Who should buy

  • 32B-class local at Q8 with full context — that's the sweet spot.
  • 70B-class with IQ3 quants and modest context, if a Mac Studio's first-token latency would kill the workflow.
  • Image / video generation pipelines that lived inside 24 GB but bumped the ceiling on long videos or large LoRA stacks.

Who should skip

  • 24 GB has been enough. Bandwidth and thermals improve, but the model-class ceiling barely shifts unless you were already at it.
  • Cloud-only workflows. A 5090 is a workstation buy, not a datacenter play.

Bottom line

If you're buying a new workstation for local AI work in 2026, this is the default. 32 GB lifts the quant-vs-context squeeze a tier — not by enough to make 70B-class trivial, but enough to stop budgeting for it on most workloads.