Skip to content
MCS · Models · 25

The Lattice.

The latest models, plotted in benchmark space.

Pick any benchmark for each of three axes. Drag to orbit. 14 cloud + 11 open-weight models, every number traced to its source.

Snapshot · verified 20h ago · 2026-06-16

X axis↑ higher better
Y axis↑ higher better · tok/s
Z axis↓ lower better · $/M
Color by

The 3D lattice needs a desktop browser with WebGL. The full matrix is in the table below.

18 of 25 plotteddrag to orbit · scroll to zoom · click a node
Hover a node to read it; click to pin. Pick any benchmark for each of the three axes — the cube re-plots live.
Show
Alibaba (Qwen)AnthropicDeepSeekGoogle DeepMindMiniMaxMoonshot AIOpenAIxAIXiaomiZ.AI (Zhipu)

Reading the axes

Benchmarks lie if you don't read the fine print. Each axis, what it means, and where it bites:

Artificial Analysis Intelligence Index ↑ better
Composite 0–100 score averaging ten hard evaluations across agents, coding, general capability, and scientific reasoning.
Versioned composite (v3, 10 evals). Only comparable within one index version — pin the date.
MMLU-Pro ↑ better
Multiple-choice accuracy across 14 academic and professional domains — a harder MMLU successor with 10 answer options.
GPQA Diamond ↑ better
Accuracy on 198 graduate-level physics/chemistry/biology questions PhD experts answer ~65% of the time.
Nearing saturation (~94% at the top), compressing the high end.
Humanity's Last Exam ↑ better
~3,000 expert-crowdsourced, deliberately frontier-breaking questions across math, sciences, and the humanities.
Heavily reasoning-effort and tool-use dependent. Values are no-tools, max-reasoning.
SWE-bench Verified ↑ better
Percent of 500 human-validated real GitHub issues resolved with a patch that passes the repo’s hidden tests.
Swings 5–15 pts on the same model by harness (single-shot vs. agentic loop). Compare like with like.
LiveCodeBench ↑ better
Pass rate on competitive-programming problems released after a model’s training cutoff — contamination-free coding.
AIME 2025 ↑ better
Fraction of the 30 problems from the 2025 American Invitational Mathematics Examination solved correctly.
Saturated — multiple frontier models hit 100%. Stops discriminating at the top.
LMArena Elo ↑ better
Bradley-Terry rating from millions of blind, pairwise human preference votes on head-to-head chat responses.
Relative + re-anchoring as models enter. Style-control on/off shifts rankings.
Output speed ↑ better
Median output throughput in tokens generated per second during a single request.
A property of the provider endpoint, not the weights. The same model runs 5–20× faster on Cerebras/Groq.
Price (blended) ↓ better
Cost per million tokens, blended by Artificial Analysis at a 7:2:1 cache:input:output ratio.
Lower is better. The blend ratio is an editorial choice — reconcile against provider pages.
Context window ↑ better
Maximum input tokens a model can attend to in a single request, as advertised.
Advertised ≠ effective. RULER-style tests put usable context at ~50–65% of the label.
VRAM at Q4 ↓ better
Approximate GPU memory to run the model locally at Q4_K_M quantization (weights + modest context).
Lower is better. Open-weight only. Estimate ≈ params(B) × 0.55 + overhead.

The full matrix · 25 models

Every model, every metric. Cells brighten toward the leader in each column; the leader is ringed. Each number links to its source.

Method

A hand-curated snapshot, verified 2026-06-06. Standardized cross-model metrics (Intelligence Index, output speed, blended price, context) come from Artificial Analysis; human-preference Elo from LMArena; SWE-bench Verified, GPQA, HLE, MMLU-Pro and LiveCodeBench from vendor model cards and announcement posts; open-weight parameter counts from the HuggingFace safetensors index.

Every number on this page carries a source link — hover a node, open the detail card, or click any cell. Where a model hasn't published a number, the cell reads "—" rather than a guess. The model writes none of these figures; the catalog does.

Snapshot, not a live feed. Benchmarks saturate, leaderboards re-anchor, and the frontier moves weekly — re-verify against the linked sources before you cut a PO against them.