Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)

Tuning local LLM inference on a single GPU

A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right that MTP roughly doubles it, and below is the exact path that got me there on my box.

Update (2026-06-10) — corrected after community feedback. Two things in the first version were off, and r/LocalLLaMA was right to flag them. (1) ik_llama does support MTP — I'd used the deprecated -mtp flag; the canonical form is --spec-type mtp:n_max=3,p_min=0.0. (2) My headline 80.2 was a lucky 3-run draw — re-running both engines at n=12 gives ik_llama 75.2 and mainline llama.cpp 74.6: a tie at ~75 tok/s (≈2.1× over Ollama). So the honest headline is ~75 tok/s, both engines support MTP, and they're statistically identical. I've updated the numbers below and kept the story. Thanks to the folks who caught it.

TL;DR

On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to ~75 tok/s (llama.cpp + MTP) — a measured ≈2.1× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is ~1.6× at the same quant — the ≈2.1× is what you get when all three levers stack. ik_llama and mainline llama.cpp both do MTP and land within noise of each other at ~75.)

The lever table

All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:

step	what changed	backend	quant	MTP	gen tok/s	vs Ollama	VRAM
baseline	—	Ollama	Q4_K_M	—	35.7	1.00×	23.2 GB
1	engine	ik_llama.cpp	Q4_K_M	—	41.9	1.17×	17.3 GB
2	+ quant	ik_llama.cpp	IQ4_XS	—	47.5	1.33×	15.1 GB
3	+ MTP	llama.cpp / ik_llama	IQ4_XS	on	~75	≈2.1×	~15 GB

A note on fairness (and sample size): rows 0–2 use each engine's own native bench path, and row 3 is llama-server. For a clean apples-to-apples read of MTP alone, I re-ran both engines at n=12: mainline llama.cpp 45.1 (off) → 74.6 (on) = 1.65×, and ik_llama 47.2 (off) → 75.2 (on) = 1.59× — statistically a tie at ~75 tok/s (MTP-on has a CV of ~5–7%; that variance is inherent to speculative decoding, since draft acceptance fluctuates run to run). My very first run reported 80.2, but that was a lucky high draw from a 3-run sample; the 12-run mean is ~75, so that's the honest number. (Both the Ollama baseline and the llama.cpp runs fit fully in VRAM; the baseline ran at num_ctx 8192 and the llama.cpp runs at -c 4096 — generation throughput is largely insensitive to that as long as nothing spills to CPU, though it accounts for part of the VRAM difference in the table.)

Levers 1 and 2: engine and quant

Moving the same Q4_K_M model from Ollama to a bare-metal ik_llama.cpp build (CUDA, flash-attention, compiled for the 3090's sm86) took me from 35.7 → 41.9 tok/s, and dropped VRAM from 23.2 → 17.3 GB. Ollama is convenience-first — it sizes things generously and doesn't expose the lower-level knobs — so a hand-built engine is faster out of the gate. Swapping the quant from Q4_K_M to IQ4_XS added a bit more and shrank VRAM further: 47.5 tok/s, 15.1 GB. Roughly a third faster, and nothing exotic yet. (Does IQ4_XS cost quality? I checked perplexity on wikitext-2 after a reader asked: Q4_K_M = 6.996, IQ4_XS = 6.997 — a +0.01% difference, comfortably inside the error bars (±0.046). IQ4_XS can regress more on other architectures, but for Qwen3.6-27B the quant swap was effectively free.)

Lever 3: MTP (where the real jump is)

Multi-token prediction / speculative decoding is the big one. The idea: a small, fast draft predicts several tokens ahead, and the main model verifies them in one pass — when the drafts are accepted, you get multiple tokens for roughly the cost of one. Because the main model verifies every drafted token before it's emitted, the output is preserved — this is a throughput win, not a quality tradeoff.

Two things were worth knowing for my setup:

Both ik_llama and mainline llama.cpp do MTP — but the flag matters. I first tried ik_llama's -mtp, which it rejected as legacy, and wrongly concluded ik_llama couldn't do MTP. A reader set me straight: the canonical form is --spec-type mtp:n_max=3,p_min=0.0, and with it ik_llama runs MTP fine (~75 tok/s, matching mainline). Mainline llama.cpp added MTP recently (PR #22673, merged 2026-05-16) and uses --spec-type draft-mtp. Either engine gets you there.
Ollama's GGUF couldn't be reused. Qwen3.6 changed rope.dimension_sections from 3 to 4 elements; Ollama's stored blob still has the older 3-element layout, so llama.cpp refused it (expected 4, got 3). I grabbed a properly-converted GGUF instead (bartowski / a nextn-equipped MTP build) — a small heads-up if you're tempted to point llama.cpp at your existing Ollama blob.

With an MTP-equipped IQ4_XS GGUF and n-max 3, generation lands around ~75 tok/s — whether via mainline llama.cpp's --spec-type draft-mtp or ik_llama's --spec-type mtp:n_max=3,p_min=0.0.

Tuning MTP: more accepted drafts isn't more speed

The one knob that mattered for me was the draft depth (n-max, how many tokens to draft ahead):

config	gen tok/s	draft acceptance
n-max 2	77.5	78.1%
n-max 3	80.2	70.3%
n-max 4	70.7	53.4%
n-max 3 + p-min 0.6	54.1	80.0%
n-max 3 + KV q8_0	74.6	64.5%

The counterintuitive bit: higher acceptance ≠ faster. Pushing p-min to 0.6 raised acceptance to 80% but dropped throughput to 54 — the extra rejected drafts cost more than they save. Plain f16 KV beat q8 KV too. n-max 3 with f16 KV was the sweet spot. (These sweep rows are single runs, so read the pattern, not the absolute decimals — the stable 12-run figure for n-max 3 is ~75. I also went looking for a "prefill-off" trick I'd heard about and couldn't find it as a flag in current llama.cpp — draft depth was the lever that actually moved the number for me.)

Honest caveats

Keeping these front and center, because they're the difference between a benchmark and a benchmark you can trust:

~75 tok/s is this box's number (RTX 3090, WSL2), as a 12-run mean. My first writeup said 80.2 from a 3-run sample — that was a lucky high draw, and re-running at n=12 corrected it to ~75. Generation under MTP has real run-to-run variance (CV ~5–7%) because draft acceptance fluctuates.
Prefill numbers are noisy — my test prompt was short (~56 tokens), so I'm not headlining prefill. (A reader rightly asked about prompt processing at >64k context, where prefill can dominate latency; MTP only speeds generation, not prefill — I measured that in the follow-up on the prefill wall, where a 64k prompt is a ~59-second wait before the first token and MTP's 2× shrinks to ~3% of total latency.)
The bartowski Q4_K_M and Ollama's Q4_K_M are the same quantization family but different conversions (the rope change above), so they're not bit-identical weights. The model and quant family are matched; the conversion isn't.
Single GPU, single request. No batching or concurrency tested — that's a different question.
One benchmarking trap that cost me time: llama-cli -n <N> is ignored under -no-cnv, so the model just generates until timeout (mine produced a 2 GB output file and looked like a 39-minute hang — it was runaway generation). Use llama-bench for token-exact non-MTP runs, and llama-server with n_predict for MTP.

Reproduce it

Hardware: RTX 3090 24 GB (Ampere, sm86), WSL2 Ubuntu 24.04, driver 591.74, nvcc 12.0.
ik_llama.cpp (commit bbe1a51): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_NATIVE=ON
llama.cpp / mainline (commit e3471b3): cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DBUILD_SHARED_LIBS=OFF
Models: bartowski/Qwen_Qwen3.6-27B-GGUF (Q4_K_M, IQ4_XS); a nextn/MTP-equipped Qwen3.6-27B-MTP-IQ4_XS GGUF for the speculative step.
Non-MTP bench: llama-bench -m <gguf> -p 56 -n 200 -ngl 99 -fa 1 -r 3
MTP run, mainline: llama-server -m Qwen3.6-27B-MTP-IQ4_XS.gguf -ngl 99 -fa on -c 4096 --spec-type draft-mtp --spec-draft-n-max 3
MTP run, ik_llama: same model/flags, but --spec-type mtp:n_max=3,p_min=0.0. Then POST /completion with n_predict: 200; draft acceptance ≈ 70%.

Wrap-up

So the reader's nudge was a good one — Ollama really was leaving a clean ~2× on the table for this model on this card, and most of it is the MTP step. Ollama stays my default for everyday use (it's simple and it's what my tooling talks to); this build is the "I want every token/sec" setup. And honestly the best part of posting it was the correction: the thread caught both my legacy-flag mistake (ik_llama does MTP) and my lucky 80.2 draw (the honest 12-run mean is ~75) — so the version you're reading is the one the community helped get right.

Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)

TL;DR

The lever table

Levers 1 and 2: engine and quant

Lever 3: MTP (where the real jump is)

Tuning MTP: more accepted drafts isn't more speed

Honest caveats

Reproduce it

Wrap-up

관련 글

The Prefill Wall: Why MTP's 2× Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

MTP Isn't Always a Win: 1.95× on My 3090, but Speculative Decoding Is Hardware-Dependent

The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)