MTP Isn't Always a Win: 1.95× on My 3090, but Speculative Decoding Is Hardware-Dependent
MTP gave Gemma 4 12B QAT a 1.95x generation speedup on my 3090. But the same model with the same MTP draft runs 0.87x — slower — on an M1 Max. Speculative decoding is a hardware-dependent lever, not a free switch. Here are the measured numbers and why the draft-to-verify ratio decides it.
In my MTP post, speculative decoding roughly doubled Qwen3.6-27B generation on a 3090. It's tempting to read that as "turn on MTP, go faster." So I measured it on a different model — Gemma 4 12B QAT — and it's a big win on my 3090. But the same model with the same MTP draft runs slower on an M1 Max. MTP isn't a free switch; it's a hardware-dependent lever.
My 3090 numbers
Gemma 4 12B QAT (UD-Q4_K_XL) + an MTP draft head (Q8_0-MTP, a 0.47 GB nextn head, not a full second model), single RTX 3090, decode tok/s, 3 runs each:
| config | mean tok/s | speedup | draft acceptance |
|---|---|---|---|
| baseline (no MTP) | 85.9 | 1.00× | — |
MTP n-max 2 | 159.4 | 1.86× | 0.77 |
MTP n-max 3 | 167.4 | 1.95× | 0.69 |
A clean ~1.9× on the 3090, and unusually stable — run-to-run CV was under 0.5% (my earlier Qwen3.6-27B MTP runs were far noisier at 5–7%, needing a dozen runs; Gemma here settled in three). Same n-max 3 sweet spot as Qwen, same counterintuitive shape: deeper speculation has lower per-token acceptance (0.69 vs 0.77) but higher throughput, because more tokens land per verify step. Per-category, the win ranged 1.8×–2.2× (RAG and coding best at ~2.2×). The whole thing fit in ~8 GB of VRAM.
The same model, slower on an M1 Max
Here's the part worth the post. A cross-hardware benchmark (another tester's runs, same speed_bench.py harness and --jinja settings, so it's comparable) put Gemma 4 12B QAT+MTP at:
| hardware | MTP speedup (n-max 2) |
|---|---|
| RTX 3090 (mine) | 1.86× |
| RTX 5070 Ti laptop | 1.74× |
| M1 Max (16", 64 GB) | 0.87× — slower |
Same model, same MTP draft, and the M1 Max actually loses ~13% by turning MTP on.
Why MTP can make you slower
Speculative decoding wins when verifying a batch of drafted tokens is cheap relative to generating them one at a time. The draft head proposes several tokens, the main model checks them in one parallel pass, and accepted drafts give you multiple tokens for about the cost of one verify. That only pays off when you have spare compute and the verify pass is cheap:
- Capable CUDA GPU (3090, 5070 Ti): lots of compute headroom, the parallel verify is cheap, the drafts land → 1.7–1.9×.
- Apple Silicon (M1 Max), unified memory: running the draft adds compute the architecture doesn't have to spare relative to its memory bandwidth, and that overhead outweighs the parallel-verify gain. Net result: slower than just decoding normally.
So MTP's speedup is a function of the draft-cost-to-verify-cost ratio on your specific hardware, not a property of the technique. Capable CUDA GPU: yes. Apple Silicon: measure first — it can backfire.
Honest caveats
- Only the 3090 numbers are mine; the M1 Max / 5070 Ti figures are another tester's runs. Same harness and settings, so the comparison is fair, but it isn't a single controlled rig — read the cross-machine table as directional.
- Gemma 4 12B
itruns as a thinking model under--jinja(output routes to a reasoning channel); this doesn't affect the decode-tok/s measurement, which comes from the server's own token timings, and all machines used the same--jinja. - MTP throughput can vary run to run; the 3090 numbers were stable (CV < 0.5%), but the M1 Max's 0.87× is close enough to 1.0 that it's worth re-running before treating it as exact — the direction (a net loss) is the point.
Reproduce it (3090)
- llama.cpp mainline, commit
e3471b3(already accepts the Gemma MTP draft — no special build), CUDA sm86. - Models from
unsloth/gemma-4-12B-it-qat-GGUF: maingemma-4-12B-it-qat-UD-Q4_K_XL.gguf, draftgemma-4-12B-it-Q8_0-MTP.gguf. - Server:
llama-server -m <main> --model-draft <draft> --spec-type draft-mtp --spec-draft-n-max 3 -np 1 -ub 512 -c 16384 -ngl 99 -fa on --jinja - Client:
speed_bench.py --bench qualitative --category all --osl 1024 --concurrency 1, compared withspeed_bench_compare.py.
Takeaway
MTP is one of the best generation-speed levers on a capable CUDA GPU — ~1.9× here on a 3090, for free quality-wise (the verify keeps the output exact). But "speculative decoding makes you faster" is a hardware claim, not a universal one. On Apple Silicon it can be pure overhead. Measure it on your own box before you commit to it.
관련 글
The Prefill Wall: Why MTP's 2× Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)
6월 10일 · 6 min read
Local LLMDoubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)
6월 9일 · 8 min read
Local LLMGemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k
6월 10일 · 6 min read
Local LLMThe Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)
6월 7일 · 4 min read