The Lattice.
The latest models, plotted in benchmark space.
Pick any benchmark for each of three axes. Drag to orbit. 14 cloud + 11 open-weight models, every number traced to its source.
Snapshot · verified 20h ago · 2026-06-16
The 3D lattice needs a desktop browser with WebGL. The full matrix is in the table below.
Reading the axes
Benchmarks lie if you don't read the fine print. Each axis, what it means, and where it bites:
- Artificial Analysis Intelligence Index ↑ better
- Composite 0–100 score averaging ten hard evaluations across agents, coding, general capability, and scientific reasoning.
- Versioned composite (v3, 10 evals). Only comparable within one index version — pin the date.
- MMLU-Pro ↑ better
- Multiple-choice accuracy across 14 academic and professional domains — a harder MMLU successor with 10 answer options.
- GPQA Diamond ↑ better
- Accuracy on 198 graduate-level physics/chemistry/biology questions PhD experts answer ~65% of the time.
- Nearing saturation (~94% at the top), compressing the high end.
- Humanity's Last Exam ↑ better
- ~3,000 expert-crowdsourced, deliberately frontier-breaking questions across math, sciences, and the humanities.
- Heavily reasoning-effort and tool-use dependent. Values are no-tools, max-reasoning.
- SWE-bench Verified ↑ better
- Percent of 500 human-validated real GitHub issues resolved with a patch that passes the repo’s hidden tests.
- Swings 5–15 pts on the same model by harness (single-shot vs. agentic loop). Compare like with like.
- LiveCodeBench ↑ better
- Pass rate on competitive-programming problems released after a model’s training cutoff — contamination-free coding.
- AIME 2025 ↑ better
- Fraction of the 30 problems from the 2025 American Invitational Mathematics Examination solved correctly.
- Saturated — multiple frontier models hit 100%. Stops discriminating at the top.
- LMArena Elo ↑ better
- Bradley-Terry rating from millions of blind, pairwise human preference votes on head-to-head chat responses.
- Relative + re-anchoring as models enter. Style-control on/off shifts rankings.
- Output speed ↑ better
- Median output throughput in tokens generated per second during a single request.
- A property of the provider endpoint, not the weights. The same model runs 5–20× faster on Cerebras/Groq.
- Price (blended) ↓ better
- Cost per million tokens, blended by Artificial Analysis at a 7:2:1 cache:input:output ratio.
- Lower is better. The blend ratio is an editorial choice — reconcile against provider pages.
- Context window ↑ better
- Maximum input tokens a model can attend to in a single request, as advertised.
- Advertised ≠ effective. RULER-style tests put usable context at ~50–65% of the label.
- VRAM at Q4 ↓ better
- Approximate GPU memory to run the model locally at Q4_K_M quantization (weights + modest context).
- Lower is better. Open-weight only. Estimate ≈ params(B) × 0.55 + overhead.
The full matrix · 25 models
Every model, every metric. Cells brighten toward the leader in each column; the leader is ringed. Each number links to its source.
Method
A hand-curated snapshot, verified 2026-06-06. Standardized cross-model metrics (Intelligence Index, output speed, blended price, context) come from Artificial Analysis; human-preference Elo from LMArena; SWE-bench Verified, GPQA, HLE, MMLU-Pro and LiveCodeBench from vendor model cards and announcement posts; open-weight parameter counts from the HuggingFace safetensors index.
Every number on this page carries a source link — hover a node, open the detail card, or click any cell. Where a model hasn't published a number, the cell reads "—" rather than a guess. The model writes none of these figures; the catalog does.