Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → 80.2 tok/s)
A commenter pointed me at a faster backend and multi-token prediction to roughly double my 3090's throughput. I measured it one lever at a time: 35.7 tok/s on Ollama → 80.2 on llama.cpp with MTP, a real 2.25×. Here's the exact path that got me there, with the numbers and the gotchas.
A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path that got me there on my box.
TL;DR
On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to 80.2 tok/s (llama.cpp + MTP) — a measured 2.25× — by stacking three independent levers: a leaner engine, a smaller quant, and speculative decoding. The interesting part isn't the headline; it's which lever bought how much, and a couple of things that tripped me up on the way. (To be precise up front: MTP on its own is 1.78× at the same quant — the 2.25× is what you get when all three levers stack.)
The lever table
All on one RTX 3090, Qwen3.6-27B, 200 tokens generated, flash-attention on:
| step | what changed | backend | quant | MTP | gen tok/s | vs Ollama | VRAM |
|---|---|---|---|---|---|---|---|
| baseline | — | Ollama | Q4_K_M | — | 35.7 | 1.00× | 23.2 GB |
| 1 | engine | ik_llama.cpp | Q4_K_M | — | 41.9 | 1.17× | 17.3 GB |
| 2 | + quant | ik_llama.cpp | IQ4_XS | — | 47.5 | 1.33× | 15.1 GB |
| 3 | + MTP | llama.cpp | IQ4_XS | on | 80.2 | 2.25× | ~15 GB |
A note on fairness: rows 0–2 use each engine's own native bench path, and row 3 is llama-server. For a clean apples-to-apples read of MTP alone, the same llama-server went 45.1 (MTP off) → 80.2 (MTP on) = 1.78×. So MTP by itself is ~1.78× on identical engine/model/tool; the 2.25× is the full stack vs Ollama. (Both the Ollama baseline and the llama.cpp runs fit fully in VRAM; the baseline ran at num_ctx 8192 and the llama.cpp runs at -c 4096 — generation throughput is largely insensitive to that as long as nothing spills to CPU, though it accounts for part of the VRAM difference in the table.)
Levers 1 and 2: engine and quant
Moving the same Q4_K_M model from Ollama to a bare-metal ik_llama.cpp build (CUDA, flash-attention, compiled for the 3090's sm86) took me from 35.7 → 41.9 tok/s, and dropped VRAM from 23.2 → 17.3 GB. Ollama is convenience-first — it sizes things generously and doesn't expose the lower-level knobs — so a hand-built engine is faster out of the gate. Swapping the quant from Q4_K_M to IQ4_XS added a bit more and shrank VRAM further: 47.5 tok/s, 15.1 GB. Roughly a third faster, and nothing exotic yet.
Lever 3: MTP (where the real jump is)
Multi-token prediction / speculative decoding is the big one. The idea: a small, fast draft predicts several tokens ahead, and the main model verifies them in one pass — when the drafts are accepted, you get multiple tokens for roughly the cost of one. Because the main model verifies every drafted token before it's emitted, the output is preserved — this is a throughput win, not a quality tradeoff.
Two things were worth knowing for my setup:
- In my build, MTP came from mainline
llama.cpp, not ik_llama. ik_llama got me to ~47 (engine + quant), but I couldn't get MTP running there — my build rejected the-mtpflags and ignored the model'snextntensors. Mainlinellama.cppadded MTP fairly recently (PR #22673, merged 2026-05-16), and that's where it worked for me. (There may well be an ik_llama path I missed — this is just what got it going on my box.) - Ollama's GGUF couldn't be reused. Qwen3.6 changed
rope.dimension_sectionsfrom 3 to 4 elements; Ollama's stored blob still has the older 3-element layout, sollama.cpprefused it (expected 4, got 3). I grabbed a properly-converted GGUF instead (bartowski / anextn-equipped MTP build) — a small heads-up if you're tempted to pointllama.cppat your existing Ollama blob.
With mainline llama.cpp, an MTP-equipped IQ4_XS GGUF, and --spec-draft-n-max 3, generation hit 80.2 tok/s.
Tuning MTP: more accepted drafts isn't more speed
The one knob that mattered for me was --spec-draft-n-max (how many tokens to draft ahead):
| config | gen tok/s | draft acceptance |
|---|---|---|
| n-max 2 | 77.5 | 78.1% |
| n-max 3 | 80.2 | 70.3% |
| n-max 4 | 70.7 | 53.4% |
| n-max 3 + p-min 0.6 | 54.1 | 80.0% |
| n-max 3 + KV q8_0 | 74.6 | 64.5% |
The counterintuitive bit: higher acceptance ≠ faster. Pushing p-min to 0.6 raised acceptance to 80% but dropped throughput to 54 — the extra rejected drafts cost more than they save. Plain f16 KV beat q8 KV too. n-max 3 with f16 KV was the sweet spot. (I also went looking for a "prefill-off" trick I'd heard about and couldn't find it as a flag in current llama.cpp — --spec-draft-n-max was the lever that actually moved the number for me.)
Honest caveats
Keeping these front and center, because they're the difference between a benchmark and a benchmark you can trust:
- 80.2 tok/s is this box's number (RTX 3090, WSL2). The originally-cited "~80" was a different setup; I reproduced ~80 honestly here.
- Prefill numbers are noisy — my test prompt was short (~56 tokens), so I'm not headlining prefill. Generation tok/s is solid (±0.1).
- The bartowski
Q4_K_Mand Ollama'sQ4_K_Mare the same quantization family but different conversions (the rope change above), so they're not bit-identical weights. The model and quant family are matched; the conversion isn't. - Single GPU, single request. No batching or concurrency tested — that's a different question.
- One benchmarking trap that cost me time:
llama-cli -n <N>is ignored under-no-cnv, so the model just generates until timeout (mine produced a 2 GB output file and looked like a 39-minute hang — it was runaway generation). Usellama-benchfor token-exact non-MTP runs, andllama-serverwithn_predictfor MTP.
Reproduce it
- Hardware: RTX 3090 24 GB (Ampere, sm86), WSL2 Ubuntu 24.04, driver 591.74, nvcc 12.0.
- ik_llama.cpp (commit
bbe1a51):cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_NATIVE=ON - llama.cpp / mainline, has MTP (commit
e3471b3):cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DBUILD_SHARED_LIBS=OFF - Models:
bartowski/Qwen_Qwen3.6-27B-GGUF(Q4_K_M,IQ4_XS); anextn/MTP-equippedQwen3.6-27B-MTP-IQ4_XSGGUF for the speculative step. - Non-MTP bench:
llama-bench -m <gguf> -p 56 -n 200 -ngl 99 -fa 1 -r 3 - MTP run (the winner):
llama-server -m Qwen3.6-27B-MTP-IQ4_XS.gguf -ngl 99 -fa on -c 4096 --spec-type draft-mtp --spec-draft-n-max 3, then POST/completionwithn_predict: 200. Draft acceptance ≈ 70%.
Wrap-up
So the reader's nudge was a good one — Ollama really was leaving a clean ~2× on the table for this model on this card, and most of it is the MTP step. Ollama stays my default for everyday use (it's simple and it's what my tooling talks to); this build is the "I want every token/sec" setup. If you've gotten MTP working under ik_llama, or found the prefill trick, I'd genuinely like to hear how — that's the part I couldn't crack.
관련 글
The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)
6월 7일 · 4 min read
Local LLMBuilding a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)
6월 6일 · 6 min read
일반GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 12 min read
일반Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
5월 27일 · 15 min read