Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k
QAT is the buzz around Gemma 4, so I ran it on actual old hardware. The quality claim holds up (vs naive Q4), the speed win is modest (~9%), and yes — you can run the 12B on an 8 GB card at 16k context. Here are the measured numbers and the exact recipe.
Quantization-Aware Training (QAT) is the headline feature of the Gemma 4 release: models trained to survive 4-bit quantization, so the Q4 version stays close to full quality instead of degrading the way a naive post-training quant does. The pitch is great. I wanted to know what it actually buys on hardware most people would call obsolete — a GTX 1080 Ti — and whether it makes the 12B usable on an 8 GB card. So I measured three things: quality, speed, and footprint.
Short version: the quality claim is real (against naive Q4), the speed win is modest (~9% over a regular Q4), and the 12B fits an 8 GB GPU at 16k context if you quantize the KV cache. Details below.
1. Quality: the part QAT is actually about
QAT's whole point is quality retention at Q4. Unsloth publishes a clean way to see it — top-1 token agreement with the full model, their dynamic UD-Q4_K_XL vs a naive Q4_0:
| model | UD-Q4_K_XL | naive Q4_0 |
|---|---|---|
| Gemma 4 E2B | 98.16% | 89.29% |
| Gemma 4 12B | 88.76% | 74.08% |
| Gemma 4 31B | 96.67% | 87.91% |
For the 12B that's a ~15-point gap — naive 4-bit drops a lot of the model's token choices, QAT + dynamic quant keeps most of them, at ~72% less memory than BF16 (6.72 GB vs 23.8 GB). That's the real optimization.
Honest caveat on this: that big gap is against naive Q4_0. Against a good modern quant like Q4_K_M, the difference is much smaller — which matched my own experience: on my coarse hands-on probes (token-glitch counts, a niche domain question) I couldn't reliably separate the QAT build from a solid Q4_K_M. So I trust the benchmark numbers for the quality story, not my eyeballs — and the practical read is "QAT is the best quality-per-byte option, but if you're already on a good Q4_K_M the day-to-day difference is subtle."
One useful, slightly counterintuitive tip from Unsloth: stick to UD-Q4_K_XL — going to higher precision (Q5/Q6/Q8) of these QAT models actually degrades accuracy, because the QAT was tuned for the 4-bit target.
2. Speed and size on a 1080 Ti
I ran Gemma 4 12B three ways on a single 8-year-old GTX 1080 Ti (num_ctx 8192, 100% GPU):
| build | gen tok/s | VRAM |
|---|---|---|
| regular Q4 | 28.3 | 7.6 GB |
| Google QAT | 31.0 | 7.5 GB |
| Unsloth QAT (UD-Q4_K_XL) | 30.8 | 7.2 GB |
So the QAT builds are ~9% faster and slightly smaller than a regular Q4 — a real but modest win, and all three run fully on one old card. Don't expect the quality numbers above to also make it dramatically faster; the speed/size gain is incremental. The headline is "a 12B runs comfortably at ~30 tok/s on a 1080 Ti," which is itself a nice statement about QAT-sized models on old hardware.
3. The useful part: fitting the 12B on 8 GB at 16k context
This is the question I actually get asked: can you run the 12B on an 8 GB card with a 16k context and keep it fast? The model weights are ~7 GB, so on an 8 GB card you have ~1 GB for the KV cache — and 16k of KV is the squeeze. I measured the footprint at 16k with each KV-cache type (single GPU, flash-attention on):
| KV cache | VRAM @ 16k | fits 8 GB? |
|---|---|---|
| f16 (default) | 7.7 GB | ❌ no (driver reserve pushes it over) |
| q8_0 | 7.4 GB | ✅ yes (tight, ~0.5 GB headroom) |
| q4_0 | 7.2 GB | ✅ yes (more margin) |
All three stayed 100% GPU at 16k. The default f16 KV (7.7 GB) won't reliably fit an 8 GB card once you count the driver/display reserve — which is why a naive attempt spills to CPU and crawls. Quantize the KV to q8_0 and you're at 7.4 GB with negligible quality cost; that's the sweet spot. Drop to q4_0 if you've got a display attached and want margin.
A neat detail: the KV cache at 16k is small here — q8 and q4 differ by only ~0.2 GB — because Gemma interleaves sliding-window (local) and global attention, so most layers cap their KV at the window size regardless of context length. The footprint is dominated by the ~7 GB weights, and KV quantization just buys the last ~0.3–0.7 GB you need to slip under 8 GB. (This is the flip side of the prefill wall: cheap KV doesn't make the prompt process faster, it just makes it fit.)
The recipe that works:
# ollama
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then run gemma-4-12b-qat with num_ctx 16384, all layers on GPU
# llama.cpp equivalent
llama-server -m gemma-4-12b-qat-UD-Q4_K_XL.gguf -c 16384 -fa on -ngl 99 \
--cache-type-k q8_0 --cache-type-v q8_0
Check ollama ps / nvidia-smi shows 100% GPU — if any layer offloads to CPU, throughput tanks, and that's your signal to drop the KV to q4_0 or trim the context.
Honest caveats
- Speed numbers are at num_ctx 8192; the 8 GB/16k footprint numbers are at 16k — different tests, both on the same model/quant.
- The 8 GB fit was measured on a 1080 Ti (11 GB) constrained to one GPU; I'm reporting the actual VRAM used, from which 8 GB fit is clear, but a real 8 GB card with a display attached has slightly less usable VRAM — so
q8_0is "fits headless," andq4_0is the safer bet with a monitor plugged in. - The quality numbers are Unsloth's published top-1 agreement, not my own benchmark run; my hands-on probes were too coarse to add to them.
Wrap-up
Is Gemma 4 QAT a good model? Yes — it's the best quality-per-byte way to run Gemma 4 locally, the quality retention vs naive Q4 is real and measured, and on practical hardware it's genuinely useful: a 12B at ~30 tok/s on a 1080 Ti, and a 12B at 16k on an 8 GB card if you quantize the KV. Just don't expect the "near-BF16 quality" story to also mean a big speedup — the speed/size win over a good Q4_K_M is modest. The real story is accessibility: QAT is what lets a 12B feel comfortable on a card most people wrote off years ago.
관련 글
Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field
6월 5일 · 6 min read
Local LLMDoubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)
6월 9일 · 8 min read
Local LLMThe Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)
6월 7일 · 4 min read
Local LLMBuilding a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)
6월 6일 · 6 min read