The job
You generate images and short videos locally. Flux for stills you actually ship. SDXL when speed matters more than fidelity. Wan 2.2 when the brief calls for motion. You iterate in dozens-to-hundreds per session, not ones, and you're tired of waiting on a shared cloud queue at 9pm. You have ~$4k and you'd like to spend it once.
The shape of this workload is different from an LLM rig. VRAM ceilings are softer — Flux fits in 24 GB, SDXL fits in 12 GB, Wan 2.2 scales with what you give it. What hurts is everything around the model: the checkpoint stack, the LoRA library, the VAE intermediates, the sustained 100% GPU draw across a multi-hour session.
This guide is not for you if:
- LLM inference is the primary load. Different math, different rig.
- You need real-time video. Wan 2.2 is minutes per clip.
- You're training a foundation model. This is an inference + small-LoRA box.
The build
| Part | Pick | Why |
|---|---|---|
| GPU | NVIDIA RTX 5090 (32 GB) | 32 GB GDDR7, 1,792 GB/s memory bandwidth. Flux + LoRA + ControlNet stack fits with room to spare. |
| CPU | AMD Ryzen 9 9950X | 16 cores soak VAE decode, image preprocessing, and ffmpeg encode without choking the GPU pipeline. |
| RAM | 64 GB DDR5-6000 (2x32) | VAE tiles, model swaps, and Wan 2.2 intermediates spill into system RAM. 32 GB runs out the moment you queue a batch. |
| Storage | 4 TB Samsung 990 Pro NVMe (Gen 4) | A working checkpoint + LoRA library is 1-2 TB before you notice. Cold-loading models from a slow disk wastes session time. |
| PSU | 1000 W 80+ Gold | RTX 5090 draws 575 W TGP. Headroom for transient spikes and a sustained-load duty cycle. |
| Case | Fractal Define 7 / Lian Li O11D EVO | Three intake fans minimum. Sustained compute is the workload — burst-tuned cases thermal-throttle by hour two. |
| OS | Windows 11 + WSL2, or Ubuntu 24.04 | ComfyUI, Wan2GP, Forge all run on either. Pick what your toolchain already targets. |
Approximate total: $3,800. GPU is $1,999 of that.
Numbers
- SDXL 1024x1024 — 4-7 sec per image.
- Flux 1.dev 1024x1024 — 12-20 sec per image.
- Wan 2.2 short clip — minutes per clip; varies wildly with length, resolution, and steps.
- SDXL character LoRA training — under an hour on a small dataset.
Tradeoffs
- Drop to a 4090 (24 GB), spend the savings on storage. You lose the 32 GB Flux-plus-everything-loaded headroom and the GDDR7 bandwidth, but you keep most of the throughput. Reasonable if you found a deal.
- Drop the GPU to a 5080 (16 GB). Don't. SDXL is fine; Flux gets tight; Wan 2.2 starts forcing offloads. Rigs you have to fight aren't fun rigs.
- Add a second 5090 later. ComfyUI parallelizes batch jobs across GPUs cleanly. Leave PSU headroom and a free PCIe slot now if this is the plan.
What this doesn't get you
- Real-time video generation. Wan 2.2 is minutes per clip, not frames per second.
- Training a foundation model. This is an inference + small-LoRA rig, not an H100 substitute.
- A quiet room. 575 W of sustained GPU draw is going to be audible.