Local LLM

The Open-Model Cost Chart Everyone's Sharing Is API Prices. Here's What Self-Hosting Actually Gets You (Measured)

The intelligence-vs-cost chart making the rounds shows open models winning the value quadrant. True, but the x-axis is API token price. The cheap open winners are 100B-to-1T MoEs you can't run on a desktop GPU. Here's what you can actually self-host on an 11GB and a 24GB card, measured, and where the real ceiling is.

·7 min read
#local LLM#open source#self-hosting#GPU poor#cost#ollama#benchmark#MoE

Self-hosting open LLMs on consumer GPUs

There's a chart going around: intelligence on the y-axis, cost to run on the x-axis, and a green "most attractive" quadrant in the upper left where high intelligence meets low cost. The takeaway everyone's posting is that the green quadrant is almost entirely open source. DeepSeek, GLM, MiniMax, Kimi, Qwen all show up smart-enough and cheap, while the closed frontier models sit expensive on the right.

It's a real trend and the chart isn't wrong. But read the x-axis label: cost to run is a blended API price. That number answers "what does it cost to call this model through somebody's API," which is a different question from "what does it cost to run this yourself." For those of us who self-host, the second question is the whole point, and the chart quietly hides the answer.

So here's what it skips — measured on the two cards I own.

The catch: you can't self-host the green quadrant

The open models winning that value quadrant aren't small. Take GLM-5.2 — the one everyone points to when they say the open frontier finally caught up. It's coding-first, currently the strongest open weight on the coding benchmarks: a ~744B-parameter MoE with about 40B active per token. And unlike the closed three, the weights are actually MIT-licensed. That's its whole pitch: you can run it yourself, no per-token fee, weights on your own box. The cheap API price (around $1.40 in and $4.40 out per million tokens, roughly a sixth of GPT-5.5) is the headline. But the thing that sets it apart is the other half: you can run it yourself.

Then you try to. 744B at Q4 is roughly 372GB. But it's ~40B active, not dense, and that changes the hardware story: you need 372GB of memory somewhere, yet since only ~40B fires per token you don't need all of it in fast GPU VRAM. Park the experts in system RAM, partially offload, and it'll run on a big-RAM box — slowly. The other value-quadrant models are the same shape, DeepSeek and Kimi from the high hundreds of billions toward a trillion total, a few tens of billions active. So "datacenter-only" overstates it: you can get one of these going on a serious workstation. What you can't do is fit it in fast VRAM on a desktop card or two, and that's the part that decides whether it's usable or a slideshow.

That RAM-offload regime, where the active params live in slow memory, is exactly what bites you on consumer hardware. I hit it on a much smaller model, and it's worth seeing the numbers before you bank on offloading a 744B.

So when you self-host, you don't get the green quadrant. You get whatever fits on the card in front of you, which is a tier below. The useful question is: how far below, and is it good enough? That part I can answer with numbers instead of a chart.

What actually runs on a consumer card

Two tiers, both single consumer GPUs, models running fully on the GPU through Ollama. These are my own measured runs from earlier write-ups, pulled into one place:

GPU (used price)best model that fits wellgen tok/sprefill tok/scontext headroom
11GB — GTX 1080 Ti (~$200)Gemma 4 12B QAT~32~31512B at 16k with q8 KV
Qwen3 8B~46~1390comfortable
24GB — RTX 3090 (~$800)Qwen3.6 27B Q4 + MTP~75—¹dense 27B fits in VRAM

¹ Prefill doesn't reduce to one number on this card; it scales hard with context. At 64k the first token took about 59s. See "Long context is the real tax" below.

The 11GB card tops out comfortably at a 12B. A dense 27B doesn't fit one of them at all. The 24GB card moves you up to a dense 27B at a fast ~75 tok/s once speculative decoding is on, and that's the sweet spot: a 27B is a real step up in capability from a 12B, and it still lives entirely in VRAM.

On the intelligence chart, those are the mid-tier models, well below the green-quadrant frontier-open ones. So that's the real answer to "what does self-hosting get you": solid, useful, a tier under the cheap-API winners.

What the API number hides

Three costs that never show up as a dollar figure on that chart, and all three bit me at some point.

The VRAM ceiling is a wall, not a slope. A model either fits or it doesn't. The 27B that flies on a 3090 simply won't load on an 11GB card — no "a bit slower" middle ground at the boundary, it just fails, and your only move is a smaller model or a bigger card.

Spilling a MoE to system RAM looks like the obvious escape hatch when a model is too big. It isn't. I tried it with a 35B-A3B (~3B active) across two 1080 Tis and got about 17 tok/s — once the experts get mmapped to system RAM the whole thing goes memory-bandwidth-bound on the active params, and a CPU nearly tied it. A 12B living entirely in VRAM often feels snappier than a 35B that spills, which isn't what the parameter count would tell you. This is the same regime GLM-5.2 lands in if you offload it at home, just at a much larger scale, so that's the speed to expect, not the API's.

The 3090's catch shows up at long context. It generates fast, but prompt processing scales hard: at 64k tokens the first token took about 59 seconds before generation even started. That latency never appears in a tokens-per-dollar number, and for anything retrieval-heavy it's the thing you feel.

So is it worth self-hosting?

If you're chasing the cheapest intelligence-per-token, the chart is right and the answer is often no. A cheap API to something like GLM-5.2 will beat your 3090 on raw capability per dollar, because you're not paying to keep a card idle between prompts, and you're getting a 744B model instead of a 27B.

Self-hosting is a bad way to win the cost game. What it buys you is the stuff that axis never measures: your data stays on the box, it runs offline, you can fine-tune and pin versions, and nobody deprecates a model out from under you. That last one is less abstract than it sounds. A weight already sitting on your disk under MIT is the one version nobody can reprice, retire, or region-lock on you later, which is part of why the open releases are starting to get talked about as insurance and not just a cheaper API. I run a local research assistant over my own papers for exactly that reason, and "a tier below the frontier" is completely fine for it. That's what you're paying for — privacy, control, a version nobody can pull out from under you. The per-token math is a side issue.

So that's the bit the chart leaves out. On API the open models do win on price — no argument there. But once the weights are on your own card you drop a tier, you hit the VRAM wall, and long prompts crawl. Nobody self-hosting at home is doing it to shave a few dollars a month. They do it because the weights are theirs, sitting on a disk nobody can reprice or retire.

Caveats

These are two cards I actually own, an 11GB Pascal and a 24GB Ampere, single-GPU, Ollama, the specific quants from my earlier posts. I don't have a 4090, a 5090, or a multi-card rig, so I can't speak to those tiers and I'm not going to guess at them. The model sizes for the big MoEs are approximate; if you're quoting them, check the current model cards. Numbers are from my own runs and are stable, not claimed to the decimal.

관련 글