Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

Running Gemma 4 12B on an old GTX 1080 Ti

TL;DR (Quick Answer)

Gemma 4 12B just dropped, so I ran it on a GTX 1080 Ti (Pascal, 2017) to see what an 8-year-old card does with a 2026 model. Real numbers, and a few honest surprises:

Speed: ~28 tok/s at Q4_K_M on a single 1080 Ti (~8 GB VRAM). The 12B fits one card, so the second GPU sits idle.
Three things broke before it worked: the GGUF is multimodal and its vision projector crashes Ollama; it's a reasoning model that hides its answer in a thinking channel; and Q4 produces visible token glitches.
The interesting part — Q4 vs Q8. I asked it real bioinformatics questions. At Q4 it answered concepts and code well but got a niche method (the HEIDI test) confidently backwards, with garbled characters sprinkled in. Going to Q8_0 (12.7 GB, split across both 1080 Tis, ~30% slower at ~19.5 tok/s) removed the glitches and fixed the wrong answer.

Bottom line: for chat and drafting, Q4 on one old card is genuinely usable. For work where details matter, the higher quant across two cards is worth the speed hit — and it's the one case where the second 1080 Ti finally earns its keep.

Setup

Hardware: 2× NVIDIA GTX 1080 Ti (11 GB each), Pascal cc 6.1, driver 581.57, via WSL2.
Runtime: Ollama 0.30.2. Gemma 4 isn't in Ollama's library yet, so I pulled the unsloth GGUF: ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M.

The 3 things that broke first

1. It's multimodal — and the vision projector crashes Ollama. First generation returned nothing. The logs:

error: Failed to load CLIP model from .../blobs/sha256-7d10888...
llama-server process has terminated: exit status 1

Gemma 4 12B-it ships with a vision (CLIP) projector, and Ollama 0.30.2 fails to load it — taking down the whole model server. If you only want text, you have to strip the projector. Pull the model, then rebuild it text-only from the same blobs (no re-download):

ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > Gemma4.Modelfile
# delete the second `FROM ...` line — the mmproj/CLIP blob — keep only the text GGUF
ollama create gemma4-12b-text -f Gemma4.Modelfile

2. It's a reasoning model — your answer hides in thinking. With the text model, generation worked but content came back empty while eval_count was 200. The output was all going into the reasoning channel and getting cut off mid-thought at the token cap. Fix: disable thinking.

{ "model": "gemma4-12b-text", "think": false, "messages": [ ... ] }

With think: false, clean answers in ~10 seconds.

3. Q4 has visible token glitches. At Q4_K_M, prose came out with occasional garbled characters — literally self-さattention, ściindicates, stray Korean/Japanese codepoints injected mid-word. Code blocks were clean; only prose was affected. (Spoiler: Q8 fixes this.)

Speed (Q4, single 1080 Ti)

think: false, num_predict=256, measured via Ollama's API:

Generation: ~27.6 tok/s (27.5 / 27.6 / 27.7 — rock stable)
VRAM: ~8 GB on GPU0; GPU1 completely idle (0 MiB) — a 12B at Q4 fits one card, so the second GPU does nothing.

Quality: I asked it about my actual field

Speed is easy; is it useful for real work? I gave it four bioinformatics questions and checked the answers honestly:

Question	Verdict
RNA-seq normalization (raw vs TPM vs FPKM; DESeq2 input)	✅ Correct and precise
Pandas function to filter a DESeq2 results table	✅ Correct, clean, usable
Troubleshoot an implausibly high DEG count	✅ Good — batch effects, PCA, outliers, covariates
What a small HEIDI p-value means (SMR/colocalization)	❌ Confidently backwards

That last one is the lesson. HEIDI is a niche test: a small p-value means the locus fails (heterogeneity/linkage — you filter it out). Q4 Gemma told me a small p-value means a single causal gene — the exact opposite. It was fluent and sure of itself. If you don't already know the answer, that's the dangerous kind of wrong.

The payoff: Q4 vs Q8

So I pulled Q8_0 (12.7 GB) and rebuilt it text-only the same way. At 12.7 GB it no longer fits one card — Ollama splits it across both 1080 Tis (~7 GB each). Same questions:

	Q4_K_M	Q8_0
Size / GPUs	7 GB / 1 card (GPU1 idle)	12.7 GB / 2 cards (~7 GB each)
Speed	~28 tok/s	~19.5 tok/s (−30%)
Token glitches	`self-さattention` etc.	gone — clean ✅
HEIDI answer	backwards ❌	correct ✅ ("small p = fails, filter it out")

Less quantization bought three things: the glitches disappeared, it got the niche domain detail right, and — because the bigger file overflows one card — the otherwise-idle second 1080 Ti finally did work. The cost was ~30% throughput.

(Honesty note: I asked Q8 the HEIDI question with a more pointed framing than Q4, so that single comparison isn't perfectly controlled. The token-glitch difference, on identical prompts, is unambiguous.)

When does the second 1080 Ti actually help?

Combining this with an earlier 35B-MoE run, a clear rule emerges:

Model fits one card (12B Q4): second GPU is idle — useless.
Model overflows one card (12B Q8, or a 35B): it spills to the second card, which now helps.

The second 1080 Ti isn't about speed; it's about fitting a bigger or higher-precision model.

Honest Limitations

One model, two quants, one box; your tok/s will vary with CPU, RAM, and context length.
Q8 HEIDI test used a more direct prompt — suggestive, not a controlled A/B.
Quality judged on a handful of prompts, not a benchmark suite.
Ollama 0.30.2's Gemma 4 support is early (the CLIP crash, the reasoning-channel behavior); later versions may change this.

Reproduce

ollama pull hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M       # or :Q8_0 for the 2-card run
ollama show --modelfile hf.co/unsloth/gemma-4-12b-it-GGUF:Q4_K_M > m.Modelfile
# remove the mmproj/CLIP `FROM` line, keep the text GGUF
ollama create gemma4-12b-text -f m.Modelfile
# then call /api/chat with "think": false

FAQ

Q: Can a GTX 1080 Ti run Gemma 4 12B?

Yes — ~28 tok/s at Q4 on a single card, ~19.5 tok/s at Q8 across two. Just strip the vision projector (it crashes Ollama 0.30.2) and disable the reasoning channel with think: false.

Q: Q4 or Q8?

Q4 for speed and casual use (one card). Q8 when correctness matters: on my field's questions it removed the token glitches and fixed an answer Q4 got backwards — at ~30% lower speed, and it needs both cards.

Q: Why did the second GPU sit idle at Q4?

A 12B at Q4 is ~7 GB and fits one 11 GB card, so Ollama uses one GPU. Only when the model overflows one card (Q8, or a larger model) does the second card get used.

Resources

Model: unsloth/gemma-4-12b-it-GGUF
Related: 35B MoE on 2× 1080 Ti · Ollama

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

TL;DR (Quick Answer)

Setup

The 3 things that broke first

Speed (Q4, single 1080 Ti)

Quality: I asked it about my actual field

The payoff: Q4 vs Q8

When does the second 1080 Ti actually help?

Honest Limitations

Reproduce

FAQ

Resources

관련 글

Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)