The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)

Tuning local LLM inference on a single GPU

TL;DR — Ollama sizes the KV cache to your num_ctx, and the default can quietly push a model that fits in VRAM into a CPU spill that throttles it. On a 3090, capping num_ctx to 8192 ran 1.6× faster than the stock default, and cranking the context up fell off a cliff. Set num_ctx to your real working size and check ollama ps.

I mentioned this in passing in my last writeup, but it bites anyone running a big-context model on a single card, so it deserves its own breakdown — with the actual sweep.

The setup

Single RTX 3090 (24 GB), qwen3.6:27b (27.8B, Q4, ~17.4 GB of weights) on Ollama 0.24.0. The weights comfortably fit 24 GB, so you'd expect it to run fully on the GPU. It doesn't, by default.

The sweep

Same model, same prompt, ~160 tokens generated, only num_ctx changes:

num_ctx	gen tok/s	model loaded	on GPU	placement
8192	35.9	22.1 GB	22.1 GB	100% GPU
default (≈32768)	22.7	23.9 GB	21.1 GB	88% (≈2.7 GB on CPU)
32768	22.8	23.9 GB	21.1 GB	88%
131072	6.9	32.2 GB	22.7 GB	70%
262144 (native 256K)	4.2	42.4 GB	23.5 GB	55%

Two things jump out:

The stock default already costs you ~37%. Ollama's default here landed at ~32K context (identical numbers to an explicit 32768), which inflates the loaded footprint to ~23.9 GB, spills ~2.7 GB to CPU, and drops you to 22.7 tok/s. Cap it to 8192 and the whole thing fits — 35.9 tok/s, 100% on GPU.
More context is not free. Push num_ctx toward the model's native 256K and the KV cache balloons the loaded size to 32–42 GB, most of it offloaded to system RAM. You don't just lose a little — you fall off a cliff: 6.9 tok/s at 128K, 4.2 tok/s at 256K. That's 8.5× slower than the 8K case, on the same card and model.

Why this happens

A model's VRAM use is weights + KV cache, and the KV cache grows linearly with the context length you allocate. Qwen3.6-27B ships a 256K native context; if Ollama sizes the cache to a large default, weights (17.4 GB) + KV can exceed 24 GB, and the runtime offloads the overflow to CPU/RAM. Once any layer or the cache lives on the CPU, generation throughput tanks — the GPU keeps stalling on the slow side.

The trap is that nothing tells you this is happening. The model loads, answers correctly, and just runs slow.

How to check (10 seconds)

ollama ps

Look at the PROCESSOR column. 100% GPU = good. Anything like 88% GPU / 12% CPU (or a size_vram smaller than the loaded size via the API) means you're spilling — and paying for it in tok/s.

The fix

Set num_ctx to the context you actually use. Chat and RAG prompts rarely need more than 8–16K:

# per request (Ollama API):  "options": { "num_ctx": 8192 }
# or pin it into a model:
printf 'FROM qwen3.6:27b\nPARAMETER num_ctx 8192\n' > q27-8k.Modelfile
ollama create qwen3.6-27b-8k -f q27-8k.Modelfile

If you genuinely need a huge context, that's a real tradeoff to make on purpose — but don't pay the tax by accident.

Honest note

In my earlier post I quoted ~17 tok/s for the "default" case; that was a heavier ad-hoc reading. This controlled sweep puts the stock default closer to 22.7. Either way the conclusion is the same — and the 8K vs 256K gap (35.9 vs 4.2) is the part worth remembering.

Your turn

Run ollama ps on whatever you're serving right now — is it actually 100% on GPU, or quietly spilling? And for the big-native-context models (Qwen3, etc.), what context size do you actually run them at?

The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)

The setup

The sweep

Why this happens

How to check (10 seconds)

The fix

Honest note

Your turn

관련 글

Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Best Ollama Models for RTX 3090 (2026): Qwen3 vs DeepSeek vs Llama Benchmarks

Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k