Skip to content

NVIDIA · workstation

Verdict · buy-if

DGX Spark: NVIDIA in a desk-sized box

GB10 Grace Blackwell in a 1.2 kg chassis with 128 GB unified memory. Capacity wins, bandwidth loses. A development box, not a throughput box.

Product
NVIDIA DGX Spark
Published
2026-05-01T00:00:00.000Z
Price
$4,699
Score
7 / 10

Pros

  • 128 GB coherent unified memory fits 70B-class at Q8 with room for KV cache
  • 150 x 150 x 50.5 mm, 1.2 kg, 240 W PSU — sits on a desk, runs off a normal outlet
  • Native CUDA on the same software stack as the rest of the NVIDIA fleet

Cons

  • 273 GB/s memory bandwidth bottlenecks long-context decode versus HBM parts
  • $4,699 buys a 5090 tower with more raw FLOPS for the same power envelope
const{Fragment:e,jsx:n,jsxs:t}=arguments[0];function _createMdxContent(o){const a={h2:"h2",li:"li",p:"p",ul:"ul",...o.components};return t(e,{children:[n(a.h2,{children:"What we tested"}),"\n",n(a.p,{children:"Stock DGX Spark unit, 4 TB NVMe, on the bench against an RTX 5090 desktop and a Mac Studio M3 Ultra (256 GB) for the capacity-class comparison. Inference via llama.cpp and vLLM. We did not benchmark the ConnectX-7 NIC in a two-box cluster — that's the marquee multi-Spark workflow and we'll cover it separately."}),"\n",n(a.h2,{children:"What you'll feel"}),"\n",n(a.p,{children:"A 70B-class model at Q8 loads and runs. That alone is the headline: 128 GB of coherent LPDDR5X means you stop thinking about quant tier and start thinking about context length. The 5090's 32 GB forces IQ3 or IQ2 on the same model class, every time."}),"\n",n(a.p,{children:"Then you watch the tokens come out and remember the other number. 273 GB/s is the bandwidth ceiling, and decode is bandwidth-bound. Per-token latency on a 70B at long context is slower than a quantized fit on a 5090, and noticeably slower than an M3 Ultra running its 800 GB/s pool. Capacity beats compression; bandwidth beats both."}),"\n",n(a.p,{children:"Multi-model concurrent serving is where the unified pool earns its keep. Two 13B-class models plus an embedding model resident at once, no swap pressure, no pinned-memory choreography. Small-model LoRA fine-tunes complete on overnight runs — tenable, not fast."}),"\n",n(a.p,{children:"The 1 PFLOPS FP4 sparse number is the marketing peak. We did not see it."}),"\n",n(a.h2,{children:"Setup notes"}),"\n",n(a.p,{children:'Out of box on DGX OS, then standard CUDA tooling. The 140 W chip TDP and 240 W PSU mean fan noise stays below "in the same room as a phone call" under sustained inference. Wi-Fi 7 and 10 GbE on the same chassis is unusual at this size — the 10 GbE saves a thunderbolt-NAS workaround if you\'re shuttling weights.'}),"\n",n(a.h2,{children:"Who should buy"}),"\n",t(a.ul,{children:["\n",n(a.li,{children:"Solo researchers and small teams who need a development box for 70B-class work without leasing cloud hours, and who value desk-sized form factor over absolute throughput."}),"\n",n(a.li,{children:"People prototyping multi-agent systems where two or three resident models is the workflow."}),"\n",n(a.li,{children:"Anyone who has to demo native CUDA off a plane."}),"\n"]}),"\n",n(a.h2,{children:"Who should skip"}),"\n",t(a.ul,{children:["\n",n(a.li,{children:"Anyone whose workload is one model, one user, throughput-bound. A 5090 tower at the same money decodes faster on what fits."}),"\n",n(a.li,{children:"A used A6000 build trades MSRP for HBM-class bandwidth — better value for single-model throughput."}),"\n",n(a.li,{children:"Datacenter buyers stop reading three paragraphs ago."}),"\n"]}),"\n",n(a.h2,{children:"Bottom line"}),"\n",n(a.p,{children:"DGX Spark is a development workstation that happens to fit a 70B model. Buy it for the form factor and the unified pool. Don't buy it expecting H-class bandwidth — the chassis is 1.2 kg for a reason."})]})}return{default:function(e={}){const{wrapper:t}=e.components||{};return t?n(t,{...e,children:n(_createMdxContent,{...e})}):_createMdxContent(e)}};