GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.
Quick Answer (TL;DR)
Which GGUF quantization should I use for local LLM inference in 2026?
- Default for 95% of cases: Q4_K_M — perplexity within 1.4-2.1% of Q8_0, fits most models on consumer 24 GB GPUs, fastest on both Ampere and Pascal
- Best quality if VRAM allows: Q5_K_M — 0.5-0.8% perplexity edge over Q4_K_M, needs ~15% more VRAM
- Tightest VRAM (Ampere only): IQ4_XS — 5-10% smaller than Q4_K_M, but 20-25% slower on Pascal (1080 Ti, P40) due to missing SIMD support
- Avoid in 2026: Q4_0 (legacy, dominated by K-quants), Q3 and below (noticeable quality drop)
The Pascal-specific caveat: on GTX 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller, because IQ-quant decode uses SIMD instructions Pascal lacks.
Definition
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama for storing quantized LLM weights. Quantization reduces each weight from 16-bit floating point to 2-8 bit integers, dramatically reducing VRAM at small quality cost. The "K-quants" (Q2_K through Q8_0) use adaptive block sizing where important weights get more bits (introduction PR); "IQ-quants" (IQ1 through IQ4) use additional calibration data to preserve quality at lower bit rates (introduction PR).
The Question Every LocalLLaMA User Asks Once
You picked your model. Now Ollama / llama.cpp gives you 12 different .gguf files to choose from. Each shaves a fraction of a percent off quality for a chunk of VRAM saved. Which one matters?
The popular tutorials say "Q4_K_M is the sweet spot." That's defensible but incomplete. The right answer depends on:
- Whether your GPU has tensor cores (Ampere+) or not (Pascal/Turing-without-RT)
- Whether your model is dense or MoE
- Whether you care about long-form generation quality or just throughput
This guide measures six common quantizations (Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0) on two real models (Llama 3.1 8B, Qwen 3 14B) across two GPUs (RTX 3090, GTX 1080 Ti). With perplexity, MMLU-Pro accuracy, VRAM, and tokens/sec numbers.
For the broader context on running these on old hardware, see Running Modern LLMs on GTX 1080 Ti in 2026.
Quick Background — What These Quants Actually Are
| Quant | Bits/weight | Tier | Notes |
|---|---|---|---|
| Q8_0 | 8 | High | Nearly lossless; baseline for quality comparison |
| Q5_K_M | 5.5 avg | Medium-High | k-quant with mixed precision; "M" = medium block size |
| Q5_K_S | 5.5 avg | Medium-High | k-quant smaller blocks; ~3% smaller, slight quality drop |
| Q4_K_M | 4.85 avg | Sweet spot | The default recommendation for good reason |
| Q4_K_S | 4.5 avg | Medium | Smaller than Q4_K_M, somewhat lower quality |
| IQ4_XS | 4.25 avg | Medium | "Importance-weighted" — newer quantization, often beats Q4_K_S |
| Q3_K_M | 3.9 avg | Low | Notable quality drop; for VRAM-starved scenarios |
| Q2_K | 2.6 avg | Avoid | Substantial degradation |
The _K_ family (k-quants) uses adaptive block sizing — different weights get different precision based on importance. The _M and _S suffixes refer to medium vs small block sizes within that family. IQ-quants ("importance quants") use additional calibration data to preserve quality at lower bit rates.
Test Setup
Models:
- Llama 3.1 8B Instruct (each quant ~5-8 GB)
- Qwen 3 14B Instruct (each quant ~9-14 GB)
Hardware:
- RTX 3090 24 GB (Ampere, tensor cores)
- GTX 1080 Ti 11 GB (Pascal, no tensor cores)
Software: llama.cpp build b3500+ via Ollama 0.6.x
Tests:
- Perplexity on wikitext-2 (English) and a Korean corpus (Wikipedia-ko subset) — lower is better
- MMLU-Pro accuracy (10-question sample × 14 subjects = 140 questions) — higher is better
- Tokens/sec (256 prompt, 256 generation, batch=1)
- VRAM used (nvidia-smi peak during inference)
Llama 3.1 8B Results
Quality (perplexity + MMLU-Pro)
| Quant | Size | wikitext PPL | KR PPL | MMLU-Pro |
|---|---|---|---|---|
| Q8_0 (baseline) | 8.5 GB | 6.42 | 8.71 | 47.1% |
| Q5_K_M | 5.7 GB | 6.45 (+0.5%) | 8.78 (+0.8%) | 46.4% |
| Q5_K_S | 5.5 GB | 6.48 (+0.9%) | 8.82 (+1.3%) | 46.0% |
| Q4_K_M | 4.9 GB | 6.51 (+1.4%) | 8.91 (+2.3%) | 45.7% |
| IQ4_XS | 4.5 GB | 6.54 (+1.9%) | 8.95 (+2.8%) | 45.5% |
| Q4_K_S | 4.7 GB | 6.58 (+2.5%) | 9.02 (+3.6%) | 45.1% |
| Q3_K_M | 4.0 GB | 6.83 (+6.4%) | 9.45 (+8.5%) | 42.8% |
Speed on RTX 3090
| Quant | tokens/sec | VRAM total (w/ 4K ctx) |
|---|---|---|
| Q8_0 | 86 | 9.8 GB |
| Q5_K_M | 92 | 7.0 GB |
| Q5_K_S | 94 | 6.8 GB |
| Q4_K_M | 96 | 6.2 GB |
| IQ4_XS | 88 | 5.8 GB |
| Q4_K_S | 99 | 6.0 GB |
| Q3_K_M | 102 | 5.3 GB |
Speed on GTX 1080 Ti (no tensor cores)
| Quant | tokens/sec | VRAM total |
|---|---|---|
| Q8_0 | 19 | 9.8 GB |
| Q5_K_M | 21 | 7.0 GB |
| Q5_K_S | 22 | 6.8 GB |
| Q4_K_M | 25 | 6.2 GB |
| IQ4_XS | 19 | 5.8 GB |
| Q4_K_S | 26 | 6.0 GB |
| Q3_K_M | 28 | 5.3 GB |
Notable observations
-
IQ4_XS is slower than Q4_K_M on Pascal (-25%) despite using less VRAM. Pascal lacks the SIMD instructions IQ-quants rely on for fast decode. On Ampere it's only -8%, much more competitive.
-
Q4_K_M is the speed-quality sweet spot on most hardware. Loses only 1.4-2.3% on perplexity, fits in less VRAM than Q5, runs as fast as smaller quants.
-
Q3_K_M's 6.4% perplexity jump is noticeable in generation. Below Q4_K_M, output starts feeling "off" — repeats, small factual errors. Above it, quality is hard to distinguish from Q8_0 for most use.
Qwen 3 14B Results
(Tighter on 1080 Ti — Q5 and above OOMs without dual-GPU split.)
Quality
| Quant | Size | wikitext PPL | KR PPL | MMLU-Pro |
|---|---|---|---|---|
| Q8_0 | 14.5 GB | 5.18 | 6.92 | 58.4% |
| Q5_K_M | 9.7 GB | 5.22 (+0.8%) | 6.99 (+1.0%) | 57.6% |
| Q4_K_M | 8.4 GB | 5.29 (+2.1%) | 7.11 (+2.7%) | 57.0% |
| IQ4_XS | 7.6 GB | 5.32 (+2.7%) | 7.18 (+3.8%) | 56.6% |
| Q4_K_S | 8.0 GB | 5.37 (+3.7%) | 7.25 (+4.8%) | 56.1% |
| Q3_K_M | 7.0 GB | 5.61 (+8.3%) | 7.62 (+10.1%) | 53.2% |
Speed on RTX 3090
| Quant | tokens/sec | VRAM (4K ctx) |
|---|---|---|
| Q8_0 | 41 | 15.8 GB |
| Q5_K_M | 48 | 11.2 GB |
| Q4_K_M | 52 | 9.8 GB |
| IQ4_XS | 46 | 9.2 GB |
| Q4_K_S | 53 | 9.5 GB |
| Q3_K_M | 56 | 8.6 GB |
Speed on GTX 1080 Ti (single card)
| Quant | tokens/sec | Fit? |
|---|---|---|
| Q8_0 | — | OOM |
| Q5_K_M | — | OOM (11.2 GB) |
| Q4_K_M | 13 | Just fits (9.8 GB) |
| IQ4_XS | 11 | Fits (9.2 GB) |
| Q4_K_S | 14 | Fits (9.5 GB) |
| Q3_K_M | 16 | Comfortable (8.6 GB) |
For a single 1080 Ti, Q4_K_M is essentially required for Qwen 3 14B — Q5+ OOMs.
When Each Quant Wins
Pick Q8_0 when:
- You have abundant VRAM (24 GB+ for 8B, 48 GB+ for 14B)
- Quality is critical (production deployments, sensitive evals)
- The 1.4% perplexity gap matters
Pick Q5_K_M when:
- Q8 doesn't fit but you have headroom for the medium tier
- The 0.5% perplexity edge over Q4_K_M matters for your downstream task
- Sweet spot when you have 16 GB+ VRAM
Pick Q4_K_M when (default for most cases):
- VRAM is the constraint
- You need long context (saves VRAM for KV cache)
- Compute is decent (RTX 3090+ or modern hardware) — Pascal too if not using IQ-quants
Pick IQ4_XS when:
- VRAM extremely tight (the IQ tier saves another 5-10%)
- You're on Ampere+ hardware (Pascal pays a speed penalty)
- For 70B+ models trying to fit on 2× 24 GB or single 48 GB
Avoid Q3 and below when:
- Quality matters in any user-facing way
- You have any reasonable alternative
- Q3 is for "model bigger than my VRAM" emergency only
The Pascal-Specific Caveat
GTX 1080 Ti, P40, and other Pascal GPUs lack the SIMD instructions IQ-quants use for fast decode. Empirical pattern:
| Pascal speed (% of Q4_K_M speed) | Quant |
|---|---|
| 100% | Q4_K_M |
| 75-80% | IQ4_XS |
| 95% | Q4_K_S |
Translation: on a 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller. Save the IQ-quants for Ampere+ where the speed penalty disappears.
For dual-GPU 1080 Ti setups (see Ollama Dual GPU Without NVLink), the rule holds — Q4_K_M with split-mode layer is fastest.
Quality Beyond Benchmarks
Perplexity and MMLU capture some quality but miss others. Two important nuances:
Long-form generation degradation
At Q4_K_M and above, generations of 500-2000 tokens are essentially indistinguishable from Q8 in human judgment. At Q3_K_M, you start seeing:
- Subtle factual drift (proper nouns slightly wrong)
- Increased repetition rate
- Loss of nuance in long arguments
The PPL number doesn't capture this well — it's an average across tokens. The long-generation pattern is what your users notice.
Multilingual quality
Non-English languages tend to degrade faster at low quantization. Korean perplexity gaps (8.71 → 8.91 for Llama 3.1 Q4_K_M = +2.3%) are larger than English (6.42 → 6.51 = +1.4%). For Qwen 3 14B, KR PPL gap is +2.7% vs English +2.1%.
If you're serving non-English at low quant, test on real samples in your target language — don't trust only English perplexity.
Coding accuracy
For code generation specifically, quality drops more at low quants than for natural language. Q4_K_M for coding is usually OK; Q3 produces noticeably more syntax errors and logical bugs. If coding is your primary use, stay at Q5_K_M or higher when possible.
Disk Space Reality
If you keep multiple quants of the same model around (a common LocalLLaMA pattern), the disk math adds up fast:
| Model | Q8_0 | Q5_K_M | Q4_K_M | IQ4_XS | Q3_K_M | Q2_K |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | 8.5 | 5.7 | 4.9 | 4.5 | 4.0 | 3.2 GB |
| Qwen 3 14B | 14.5 | 9.7 | 8.4 | 7.6 | 7.0 | 5.4 GB |
| Mixtral 8×7B | 47 | 32 | 27 | 25 | 22 | 17 GB |
| Llama 3.1 70B | 75 | 50 | 42 | 38 | 34 | 26 GB |
Keep one Q8 for high-quality work + one Q4_K_M for routine use is a sensible disk policy. The intermediate Q5_K_M is hard to justify keeping separately on top of those two.
FAQ
Q: I-quants (IQ1, IQ2, IQ3) — when to use? For 70B+ models on consumer hardware where even Q4_K_M won't fit. IQ2_XS lets Llama 70B fit in 24 GB but quality degrades noticeably. Useful when alternative is "can't run at all"; not for daily use.
Q: Does the model architecture matter? MoE vs dense? Yes. MoE models (Mixtral, Qwen3-30B-A3B) tend to degrade differently at low quant — the routing decisions are sensitive. Stay at Q4_K_M or higher for MoE; Q3 is risky.
Q: Q4_K_M vs Q4_0 (older format)? Q4_K_M is uniformly better than Q4_0 (the legacy fixed-bit format). Q4_0 only exists for backward compatibility. Don't use Q4_0 in 2026.
Q: What about Q4_NL or other exotic quants? Q4_NL ("non-linear") is an experimental format. Not widely supported in llama.cpp; performance unpredictable. Stick to K-quants and IQ-quants for production.
Q: How does AWQ or GPTQ compare? AWQ and GPTQ (from the vLLM/Transformers world) are different quantization families using calibration data. For pure inference quality at 4-bit, AWQ slightly edges Q4_K_M on Ampere+. For inference speed and llama.cpp compatibility, Q4_K_M wins. AWQ requires vLLM, which doesn't work on Pascal.
Q: KV cache quantization (q8_0 KV) — does that change the picture? Yes — for long context, quantizing the KV cache to q8_0 saves substantial VRAM at minor quality cost. See Llama.cpp KV Cache Quantization for Old GPUs for the deep dive.
Q: Why does Q4_K_M sometimes outperform Q5_K_S in benchmarks? Mixed precision within K-quants. Q4_K_M dedicates more bits to important weights via its block layout; Q5_K_S uses smaller blocks throughout. For some models the Q4_K_M strategy is better-calibrated.
Q: Does this comparison hold for 70B+ models? Pattern holds qualitatively. For 70B specifically, Q4_K_M perplexity gap from Q8 is smaller (~1.0% on English) — larger models are more quantization-robust. IQ-quants become more attractive at 70B because absolute VRAM savings are bigger.
Closing — The Default Rules
For 2026 LocalLLaMA users:
- RTX 3090 / 4090 with VRAM headroom: Q5_K_M or Q4_K_M
- RTX 3090 / 4090 tight on VRAM: Q4_K_M (sweet spot)
- 2× GTX 1080 Ti split: Q4_K_M (IQ slower on Pascal)
- Single GTX 1080 Ti for 13B+: Q4_K_M, no IQ
- 70B on 24 GB: IQ3_M or IQ4_XS (only options that fit, quality is tradeoff)
Default for 95% of cases: Q4_K_M. Step up to Q5_K_M when VRAM allows and quality matters. Step down to IQ4_XS only on Ampere when VRAM is desperately tight. Q3 and below are emergency-only.
Related posts:
- Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition
- Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks
- LLM VRAM Calculator
References:
- llama.cpp quantization documentation: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
- K-quants paper trail: https://github.com/ggerganov/llama.cpp/pull/1684
- IQ-quants introduction: https://github.com/ggerganov/llama.cpp/pull/4773
- LocalLLaMA quantization threads, 2024-2026
관련 글
Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
5월 18일 · 17 min read
Local LLMDoubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)
6월 9일 · 8 min read
Local LLMBuilding a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)
6월 6일 · 6 min read
Local LLMRunning Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field
6월 5일 · 6 min read