The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)
Ollama sizes the KV cache to your context length, and the default can quietly push a model that fits in VRAM into a CPU spill — cutting throughput. A full num_ctx sweep of Qwen3.6-27B on a single RTX 3090 shows exactly where the cliff is, and why a bigger context is not free.
TL;DR — Ollama sizes the KV cache to your
num_ctx, and the default can quietly push a model that fits in VRAM into a CPU spill that throttles it. On a 3090, cappingnum_ctxto 8192 ran 1.6× faster than the stock default, and cranking the context up fell off a cliff. Setnum_ctxto your real working size and checkollama ps.
I mentioned this in passing in my last writeup, but it bites anyone running a big-context model on a single card, so it deserves its own breakdown — with the actual sweep.
The setup
Single RTX 3090 (24 GB), qwen3.6:27b (27.8B, Q4, ~17.4 GB of weights) on Ollama 0.24.0. The weights comfortably fit 24 GB, so you'd expect it to run fully on the GPU. It doesn't, by default.
The sweep
Same model, same prompt, ~160 tokens generated, only num_ctx changes:
| num_ctx | gen tok/s | model loaded | on GPU | placement |
|---|---|---|---|---|
| 8192 | 35.9 | 22.1 GB | 22.1 GB | 100% GPU |
| default (≈32768) | 22.7 | 23.9 GB | 21.1 GB | 88% (≈2.7 GB on CPU) |
| 32768 | 22.8 | 23.9 GB | 21.1 GB | 88% |
| 131072 | 6.9 | 32.2 GB | 22.7 GB | 70% |
| 262144 (native 256K) | 4.2 | 42.4 GB | 23.5 GB | 55% |
Two things jump out:
- The stock default already costs you ~37%. Ollama's default here landed at ~32K context (identical numbers to an explicit
32768), which inflates the loaded footprint to ~23.9 GB, spills ~2.7 GB to CPU, and drops you to 22.7 tok/s. Cap it to 8192 and the whole thing fits — 35.9 tok/s, 100% on GPU. - More context is not free. Push
num_ctxtoward the model's native 256K and the KV cache balloons the loaded size to 32–42 GB, most of it offloaded to system RAM. You don't just lose a little — you fall off a cliff: 6.9 tok/s at 128K, 4.2 tok/s at 256K. That's 8.5× slower than the 8K case, on the same card and model.
Why this happens
A model's VRAM use is weights + KV cache, and the KV cache grows linearly with the context length you allocate. Qwen3.6-27B ships a 256K native context; if Ollama sizes the cache to a large default, weights (17.4 GB) + KV can exceed 24 GB, and the runtime offloads the overflow to CPU/RAM. Once any layer or the cache lives on the CPU, generation throughput tanks — the GPU keeps stalling on the slow side.
The trap is that nothing tells you this is happening. The model loads, answers correctly, and just runs slow.
How to check (10 seconds)
ollama ps
Look at the PROCESSOR column. 100% GPU = good. Anything like 88% GPU / 12% CPU (or a size_vram smaller than the loaded size via the API) means you're spilling — and paying for it in tok/s.
The fix
Set num_ctx to the context you actually use. Chat and RAG prompts rarely need more than 8–16K:
# per request (Ollama API): "options": { "num_ctx": 8192 }
# or pin it into a model:
printf 'FROM qwen3.6:27b\nPARAMETER num_ctx 8192\n' > q27-8k.Modelfile
ollama create qwen3.6-27b-8k -f q27-8k.Modelfile
If you genuinely need a huge context, that's a real tradeoff to make on purpose — but don't pay the tax by accident.
Honest note
In my earlier post I quoted ~17 tok/s for the "default" case; that was a heavier ad-hoc reading. This controlled sweep puts the stock default closer to 22.7. Either way the conclusion is the same — and the 8K vs 256K gap (35.9 vs 4.2) is the part worth remembering.
Your turn
Run ollama ps on whatever you're serving right now — is it actually 100% on GPU, or quietly spilling? And for the big-native-context models (Qwen3, etc.), what context size do you actually run them at?
관련 글
Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)
6월 6일 · 6 min read
일반Best Ollama Models for RTX 3090 (2026): Qwen3 vs DeepSeek vs Llama Benchmarks
3월 30일 · 20 min read
Local LLMRunning Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field
6월 5일 · 6 min read
일반GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 12 min read