The Lattice.

The latest models, plotted in benchmark space.

Pick any benchmark for each of three axes. Drag to orbit. 14 cloud + 11 open-weight models, every number traced to its source.

Snapshot · verified 20h ago · 2026-06-16

X axis↑ higher better

Y axis↑ higher better · tok/s

Z axis↓ lower better · $/M

Color by

The 3D lattice needs a desktop browser with WebGL. The full matrix is in the table below.

18 of 25 plotteddrag to orbit · scroll to zoom · click a node

Hover a node to read it; click to pin. Pick any benchmark for each of the three axes — the cube re-plots live.

Show

Alibaba (Qwen)AnthropicDeepSeekGoogle DeepMindMiniMaxMoonshot AIOpenAIxAIXiaomiZ.AI (Zhipu)

Reading the axes

Benchmarks lie if you don't read the fine print. Each axis, what it means, and where it bites:

Artificial Analysis Intelligence Index ↑ better: Composite 0–100 score averaging ten hard evaluations across agents, coding, general capability, and scientific reasoning.; Versioned composite (v3, 10 evals). Only comparable within one index version — pin the date.
MMLU-Pro ↑ better: Multiple-choice accuracy across 14 academic and professional domains — a harder MMLU successor with 10 answer options.
GPQA Diamond ↑ better: Accuracy on 198 graduate-level physics/chemistry/biology questions PhD experts answer ~65% of the time.; Nearing saturation (~94% at the top), compressing the high end.
Humanity's Last Exam ↑ better: ~3,000 expert-crowdsourced, deliberately frontier-breaking questions across math, sciences, and the humanities.; Heavily reasoning-effort and tool-use dependent. Values are no-tools, max-reasoning.
SWE-bench Verified ↑ better: Percent of 500 human-validated real GitHub issues resolved with a patch that passes the repo’s hidden tests.; Swings 5–15 pts on the same model by harness (single-shot vs. agentic loop). Compare like with like.
LiveCodeBench ↑ better: Pass rate on competitive-programming problems released after a model’s training cutoff — contamination-free coding.
AIME 2025 ↑ better: Fraction of the 30 problems from the 2025 American Invitational Mathematics Examination solved correctly.; Saturated — multiple frontier models hit 100%. Stops discriminating at the top.
LMArena Elo ↑ better: Bradley-Terry rating from millions of blind, pairwise human preference votes on head-to-head chat responses.; Relative + re-anchoring as models enter. Style-control on/off shifts rankings.
Output speed ↑ better: Median output throughput in tokens generated per second during a single request.; A property of the provider endpoint, not the weights. The same model runs 5–20× faster on Cerebras/Groq.
Price (blended) ↓ better: Cost per million tokens, blended by Artificial Analysis at a 7:2:1 cache:input:output ratio.; Lower is better. The blend ratio is an editorial choice — reconcile against provider pages.
Context window ↑ better: Maximum input tokens a model can attend to in a single request, as advertised.; Advertised ≠ effective. RULER-style tests put usable context at ~50–65% of the label.
VRAM at Q4 ↓ better: Approximate GPU memory to run the model locally at Q4_K_M quantization (weights + modest context).; Lower is better. Open-weight only. Estimate ≈ params(B) × 0.55 + overhead.

The full matrix · 25 models

Every model, every metric. Cells brighten toward the leader in each column; the leader is ringed. Each number links to its source.

Model	AA Index	GPQA	SWE-bench	Arena	Speed	Price	Context	VRAM
Claude Fable 5cloud	65	—	95%	1508	60 tok/s	$8.20/M	1M	—
Claude Opus 4.8cloud	61	93.6%	88.6%	1482	58 tok/s	$4.10/M	1M	—
GPT-5.5cloud	60	93.5%	88.7%	1482	62 tok/s	$4.35/M	922K	—
Gemini 3.1 Procloud	57	94.1%	80.6%	1488	140 tok/s	$1.74/M	1M	—
Claude Opus 4.7cloud	57	94.2%	87.6%	1501	48 tok/s	$4.10/M	1M	—
Qwen3.7 Maxcloud	57	92.4%	—	1460	105 tok/s	$1.43/M	1M	—
Gemini 3.5 Flashcloud	55	—	—	1473	177 tok/s	$1.31/M	1M	—
MiniMax-M3cloud	55	92.9%	—	—	41 tok/s	$0.22/M	1M	—
Kimi K2.6open	54	90.5%	80.2%	1422	44 tok/s	$0.70/M	256K	—
MiMo-V2.5-Proopen	54	—	—	1425	44 tok/s	$0.18/M	1M	—
GPT-5.3 Codexcloud	54	—	—	—	86 tok/s	$1.87/M	400K	—
Qwen3.7 Pluscloud	53	—	—	—	53 tok/s	$0.25/M	1M	—
Grok 4.3cloud	53	—	—	—	198 tok/s	$0.64/M	1M	—
Claude Sonnet 4.6cloud	52	—	80.2%	1442	43 tok/s	$2.46/M	1M	—
DeepSeek V4 Proopen	52	90.1%	80.6%	—	53 tok/s	$0.18/M	1M	517 GB
Muse Sparkcloud	52	—	—	1489	—	—	262K	—
GLM-5.1open	51	86.2%	—	1467	63 tok/s	$0.90/M	200K	—
GPT-5.4 minicloud	49	—	—	—	165 tok/s	$0.65/M	400K	—
DeepSeek V4 Flashopen	47	88.1%	79%	—	122 tok/s	$0.06/M	1M	95 GB
Gemma 3 27Bopen	—	41.4%	—	—	—	—	131K	16 GB
gpt-oss-120bopen	—	80.1%	—	—	—	—	131K	63 GB
gpt-oss-20bopen	—	71.5%	—	—	—	—	131K	13 GB
Mistral Small 3.2 24Bopen	—	46.1%	—	—	—	—	131K	14 GB
Phi-4open	—	56.1%	—	—	—	—	16K	9 GB
Qwen3-32Bopen	—	—	—	—	—	—	41K	20 GB

Method

A hand-curated snapshot, verified 2026-06-06. Standardized cross-model metrics (Intelligence Index, output speed, blended price, context) come from Artificial Analysis; human-preference Elo from LMArena; SWE-bench Verified, GPQA, HLE, MMLU-Pro and LiveCodeBench from vendor model cards and announcement posts; open-weight parameter counts from the HuggingFace safetensors index.

Every number on this page carries a source link — hover a node, open the detail card, or click any cell. Where a model hasn't published a number, the cell reads "—" rather than a guess. The model writes none of these figures; the catalog does.