Local LLM

The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)

Ollama sizes the KV cache to your context length, and the default can quietly push a model that fits in VRAM into a CPU spill — cutting throughput. A full num_ctx sweep of Qwen3.6-27B on a single RTX 3090 shows exactly where the cliff is, and why a bigger context is not free.

·4 min read
#Ollama#local LLM#num_ctx#KV cache#RTX 3090#quantization#inference#throughput#Qwen3#performance

Tuning local LLM inference on a single GPU

TL;DR — Ollama sizes the KV cache to your num_ctx, and the default can quietly push a model that fits in VRAM into a CPU spill that throttles it. On a 3090, capping num_ctx to 8192 ran 1.6× faster than the stock default, and cranking the context up fell off a cliff. Set num_ctx to your real working size and check ollama ps.

I mentioned this in passing in my last writeup, but it bites anyone running a big-context model on a single card, so it deserves its own breakdown — with the actual sweep.

The setup

Single RTX 3090 (24 GB), qwen3.6:27b (27.8B, Q4, ~17.4 GB of weights) on Ollama 0.24.0. The weights comfortably fit 24 GB, so you'd expect it to run fully on the GPU. It doesn't, by default.

The sweep

Same model, same prompt, ~160 tokens generated, only num_ctx changes:

num_ctxgen tok/smodel loadedon GPUplacement
819235.922.1 GB22.1 GB100% GPU
default (≈32768)22.723.9 GB21.1 GB88% (≈2.7 GB on CPU)
3276822.823.9 GB21.1 GB88%
1310726.932.2 GB22.7 GB70%
262144 (native 256K)4.242.4 GB23.5 GB55%

Two things jump out:

  1. The stock default already costs you ~37%. Ollama's default here landed at ~32K context (identical numbers to an explicit 32768), which inflates the loaded footprint to ~23.9 GB, spills ~2.7 GB to CPU, and drops you to 22.7 tok/s. Cap it to 8192 and the whole thing fits — 35.9 tok/s, 100% on GPU.
  2. More context is not free. Push num_ctx toward the model's native 256K and the KV cache balloons the loaded size to 32–42 GB, most of it offloaded to system RAM. You don't just lose a little — you fall off a cliff: 6.9 tok/s at 128K, 4.2 tok/s at 256K. That's 8.5× slower than the 8K case, on the same card and model.

Why this happens

A model's VRAM use is weights + KV cache, and the KV cache grows linearly with the context length you allocate. Qwen3.6-27B ships a 256K native context; if Ollama sizes the cache to a large default, weights (17.4 GB) + KV can exceed 24 GB, and the runtime offloads the overflow to CPU/RAM. Once any layer or the cache lives on the CPU, generation throughput tanks — the GPU keeps stalling on the slow side.

The trap is that nothing tells you this is happening. The model loads, answers correctly, and just runs slow.

How to check (10 seconds)

ollama ps

Look at the PROCESSOR column. 100% GPU = good. Anything like 88% GPU / 12% CPU (or a size_vram smaller than the loaded size via the API) means you're spilling — and paying for it in tok/s.

The fix

Set num_ctx to the context you actually use. Chat and RAG prompts rarely need more than 8–16K:

# per request (Ollama API):  "options": { "num_ctx": 8192 }
# or pin it into a model:
printf 'FROM qwen3.6:27b\nPARAMETER num_ctx 8192\n' > q27-8k.Modelfile
ollama create qwen3.6-27b-8k -f q27-8k.Modelfile

If you genuinely need a huge context, that's a real tradeoff to make on purpose — but don't pay the tax by accident.

Honest note

In my earlier post I quoted ~17 tok/s for the "default" case; that was a heavier ad-hoc reading. This controlled sweep puts the stock default closer to 22.7. Either way the conclusion is the same — and the 8K vs 256K gap (35.9 vs 4.2) is the part worth remembering.

Your turn

Run ollama ps on whatever you're serving right now — is it actually 100% on GPU, or quietly spilling? And for the big-native-context models (Qwen3, etc.), what context size do you actually run them at?

관련 글