Local LLM

Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k

QAT is the buzz around Gemma 4, so I ran it on actual old hardware. The quality claim holds up (vs naive Q4), the speed win is modest (~9%), and yes — you can run the 12B on an 8 GB card at 16k context. Here are the measured numbers and the exact recipe.

·6 min read
#local LLM#Gemma 4#QAT#quantization#GTX 1080 Ti#Unsloth#KV cache#8GB GPU#ollama#benchmark

Running a quantization-aware model on an old GPU

Quantization-Aware Training (QAT) is the headline feature of the Gemma 4 release: models trained to survive 4-bit quantization, so the Q4 version stays close to full quality instead of degrading the way a naive post-training quant does. The pitch is great. I wanted to know what it actually buys on hardware most people would call obsolete — a GTX 1080 Ti — and whether it makes the 12B usable on an 8 GB card. So I measured three things: quality, speed, and footprint.

Short version: the quality claim is real (against naive Q4), the speed win is modest (~9% over a regular Q4), and the 12B fits an 8 GB GPU at 16k context if you quantize the KV cache. Details below.

1. Quality: the part QAT is actually about

QAT's whole point is quality retention at Q4. Unsloth publishes a clean way to see it — top-1 token agreement with the full model, their dynamic UD-Q4_K_XL vs a naive Q4_0:

modelUD-Q4_K_XLnaive Q4_0
Gemma 4 E2B98.16%89.29%
Gemma 4 12B88.76%74.08%
Gemma 4 31B96.67%87.91%

For the 12B that's a ~15-point gap — naive 4-bit drops a lot of the model's token choices, QAT + dynamic quant keeps most of them, at ~72% less memory than BF16 (6.72 GB vs 23.8 GB). That's the real optimization.

Honest caveat on this: that big gap is against naive Q4_0. Against a good modern quant like Q4_K_M, the difference is much smaller — which matched my own experience: on my coarse hands-on probes (token-glitch counts, a niche domain question) I couldn't reliably separate the QAT build from a solid Q4_K_M. So I trust the benchmark numbers for the quality story, not my eyeballs — and the practical read is "QAT is the best quality-per-byte option, but if you're already on a good Q4_K_M the day-to-day difference is subtle."

One useful, slightly counterintuitive tip from Unsloth: stick to UD-Q4_K_XL — going to higher precision (Q5/Q6/Q8) of these QAT models actually degrades accuracy, because the QAT was tuned for the 4-bit target.

2. Speed and size on a 1080 Ti

I ran Gemma 4 12B three ways on a single 8-year-old GTX 1080 Ti (num_ctx 8192, 100% GPU):

buildgen tok/sVRAM
regular Q428.37.6 GB
Google QAT31.07.5 GB
Unsloth QAT (UD-Q4_K_XL)30.87.2 GB

So the QAT builds are ~9% faster and slightly smaller than a regular Q4 — a real but modest win, and all three run fully on one old card. Don't expect the quality numbers above to also make it dramatically faster; the speed/size gain is incremental. The headline is "a 12B runs comfortably at ~30 tok/s on a 1080 Ti," which is itself a nice statement about QAT-sized models on old hardware.

3. The useful part: fitting the 12B on 8 GB at 16k context

This is the question I actually get asked: can you run the 12B on an 8 GB card with a 16k context and keep it fast? The model weights are ~7 GB, so on an 8 GB card you have ~1 GB for the KV cache — and 16k of KV is the squeeze. I measured the footprint at 16k with each KV-cache type (single GPU, flash-attention on):

KV cacheVRAM @ 16kfits 8 GB?
f16 (default)7.7 GB❌ no (driver reserve pushes it over)
q8_07.4 GB✅ yes (tight, ~0.5 GB headroom)
q4_07.2 GB✅ yes (more margin)

All three stayed 100% GPU at 16k. The default f16 KV (7.7 GB) won't reliably fit an 8 GB card once you count the driver/display reserve — which is why a naive attempt spills to CPU and crawls. Quantize the KV to q8_0 and you're at 7.4 GB with negligible quality cost; that's the sweet spot. Drop to q4_0 if you've got a display attached and want margin.

A neat detail: the KV cache at 16k is small here — q8 and q4 differ by only ~0.2 GB — because Gemma interleaves sliding-window (local) and global attention, so most layers cap their KV at the window size regardless of context length. The footprint is dominated by the ~7 GB weights, and KV quantization just buys the last ~0.3–0.7 GB you need to slip under 8 GB. (This is the flip side of the prefill wall: cheap KV doesn't make the prompt process faster, it just makes it fit.)

The recipe that works:

# ollama
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# then run gemma-4-12b-qat with num_ctx 16384, all layers on GPU
# llama.cpp equivalent
llama-server -m gemma-4-12b-qat-UD-Q4_K_XL.gguf -c 16384 -fa on -ngl 99 \
  --cache-type-k q8_0 --cache-type-v q8_0

Check ollama ps / nvidia-smi shows 100% GPU — if any layer offloads to CPU, throughput tanks, and that's your signal to drop the KV to q4_0 or trim the context.

Honest caveats

  • Speed numbers are at num_ctx 8192; the 8 GB/16k footprint numbers are at 16k — different tests, both on the same model/quant.
  • The 8 GB fit was measured on a 1080 Ti (11 GB) constrained to one GPU; I'm reporting the actual VRAM used, from which 8 GB fit is clear, but a real 8 GB card with a display attached has slightly less usable VRAM — so q8_0 is "fits headless," and q4_0 is the safer bet with a monitor plugged in.
  • The quality numbers are Unsloth's published top-1 agreement, not my own benchmark run; my hands-on probes were too coarse to add to them.

Wrap-up

Is Gemma 4 QAT a good model? Yes — it's the best quality-per-byte way to run Gemma 4 locally, the quality retention vs naive Q4 is real and measured, and on practical hardware it's genuinely useful: a 12B at ~30 tok/s on a 1080 Ti, and a 12B at 16k on an 8 GB card if you quantize the KV. Just don't expect the "near-BF16 quality" story to also mean a big speedup — the speed/size win over a good Q4_K_M is modest. The real story is accessibility: QAT is what lets a 12B feel comfortable on a card most people wrote off years ago.

관련 글