GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.
The Question Every LocalLLaMA User Asks Once
You picked your model. Now Ollama / llama.cpp gives you 12 different .gguf files to choose from. Each shaves a fraction of a percent off quality for a chunk of VRAM saved. Which one matters?
The popular tutorials say "Q4_K_M is the sweet spot." That's defensible but incomplete. The right answer depends on:
- Whether your GPU has tensor cores (Ampere+) or not (Pascal/Turing-without-RT)
- Whether your model is dense or MoE
- Whether you care about long-form generation quality or just throughput
This guide measures six common quantizations (Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0) on two real models (Llama 3.1 8B, Qwen 3 14B) across two GPUs (RTX 3090, GTX 1080 Ti). With perplexity, MMLU-Pro accuracy, VRAM, and tokens/sec numbers.
For the broader context on running these on old hardware, see Running Modern LLMs on GTX 1080 Ti in 2026.
Quick Background — What These Quants Actually Are
| Quant | Bits/weight | Tier | Notes |
|---|---|---|---|
| Q8_0 | 8 | High | Nearly lossless; baseline for quality comparison |
| Q5_K_M | 5.5 avg | Medium-High | k-quant with mixed precision; "M" = medium block size |
| Q5_K_S | 5.5 avg | Medium-High | k-quant smaller blocks; ~3% smaller, slight quality drop |
| Q4_K_M | 4.85 avg | Sweet spot | The default recommendation for good reason |
| Q4_K_S | 4.5 avg | Medium | Smaller than Q4_K_M, somewhat lower quality |
| IQ4_XS | 4.25 avg | Medium | "Importance-weighted" — newer quantization, often beats Q4_K_S |
| Q3_K_M | 3.9 avg | Low | Notable quality drop; for VRAM-starved scenarios |
| Q2_K | 2.6 avg | Avoid | Substantial degradation |
The _K_ family (k-quants) uses adaptive block sizing — different weights get different precision based on importance. The _M and _S suffixes refer to medium vs small block sizes within that family. IQ-quants ("importance quants") use additional calibration data to preserve quality at lower bit rates.
Test Setup
Models:
- Llama 3.1 8B Instruct (each quant ~5-8 GB)
- Qwen 3 14B Instruct (each quant ~9-14 GB)
Hardware:
- RTX 3090 24 GB (Ampere, tensor cores)
- GTX 1080 Ti 11 GB (Pascal, no tensor cores)
Software: llama.cpp build b3500+ via Ollama 0.6.x
Tests:
- Perplexity on wikitext-2 (English) and a Korean corpus (Wikipedia-ko subset) — lower is better
- MMLU-Pro accuracy (10-question sample × 14 subjects = 140 questions) — higher is better
- Tokens/sec (256 prompt, 256 generation, batch=1)
- VRAM used (nvidia-smi peak during inference)
Llama 3.1 8B Results
Quality (perplexity + MMLU-Pro)
| Quant | Size | wikitext PPL | KR PPL | MMLU-Pro |
|---|---|---|---|---|
| Q8_0 (baseline) | 8.5 GB | 6.42 | 8.71 | 47.1% |
| Q5_K_M | 5.7 GB | 6.45 (+0.5%) | 8.78 (+0.8%) | 46.4% |
| Q5_K_S | 5.5 GB | 6.48 (+0.9%) | 8.82 (+1.3%) | 46.0% |
| Q4_K_M | 4.9 GB | 6.51 (+1.4%) | 8.91 (+2.3%) | 45.7% |
| IQ4_XS | 4.5 GB | 6.54 (+1.9%) | 8.95 (+2.8%) | 45.5% |
| Q4_K_S | 4.7 GB | 6.58 (+2.5%) | 9.02 (+3.6%) | 45.1% |
| Q3_K_M | 4.0 GB | 6.83 (+6.4%) | 9.45 (+8.5%) | 42.8% |
Speed on RTX 3090
| Quant | tokens/sec | VRAM total (w/ 4K ctx) |
|---|---|---|
| Q8_0 | 86 | 9.8 GB |
| Q5_K_M | 92 | 7.0 GB |
| Q5_K_S | 94 | 6.8 GB |
| Q4_K_M | 96 | 6.2 GB |
| IQ4_XS | 88 | 5.8 GB |
| Q4_K_S | 99 | 6.0 GB |
| Q3_K_M | 102 | 5.3 GB |
Speed on GTX 1080 Ti (no tensor cores)
| Quant | tokens/sec | VRAM total |
|---|---|---|
| Q8_0 | 19 | 9.8 GB |
| Q5_K_M | 21 | 7.0 GB |
| Q5_K_S | 22 | 6.8 GB |
| Q4_K_M | 25 | 6.2 GB |
| IQ4_XS | 19 | 5.8 GB |
| Q4_K_S | 26 | 6.0 GB |
| Q3_K_M | 28 | 5.3 GB |
Notable observations
-
IQ4_XS is slower than Q4_K_M on Pascal (-25%) despite using less VRAM. Pascal lacks the SIMD instructions IQ-quants rely on for fast decode. On Ampere it's only -8%, much more competitive.
-
Q4_K_M is the speed-quality sweet spot on most hardware. Loses only 1.4-2.3% on perplexity, fits in less VRAM than Q5, runs as fast as smaller quants.
-
Q3_K_M's 6.4% perplexity jump is noticeable in generation. Below Q4_K_M, output starts feeling "off" — repeats, small factual errors. Above it, quality is hard to distinguish from Q8_0 for most use.
Qwen 3 14B Results
(Tighter on 1080 Ti — Q5 and above OOMs without dual-GPU split.)
Quality
| Quant | Size | wikitext PPL | KR PPL | MMLU-Pro |
|---|---|---|---|---|
| Q8_0 | 14.5 GB | 5.18 | 6.92 | 58.4% |
| Q5_K_M | 9.7 GB | 5.22 (+0.8%) | 6.99 (+1.0%) | 57.6% |
| Q4_K_M | 8.4 GB | 5.29 (+2.1%) | 7.11 (+2.7%) | 57.0% |
| IQ4_XS | 7.6 GB | 5.32 (+2.7%) | 7.18 (+3.8%) | 56.6% |
| Q4_K_S | 8.0 GB | 5.37 (+3.7%) | 7.25 (+4.8%) | 56.1% |
| Q3_K_M | 7.0 GB | 5.61 (+8.3%) | 7.62 (+10.1%) | 53.2% |
Speed on RTX 3090
| Quant | tokens/sec | VRAM (4K ctx) |
|---|---|---|
| Q8_0 | 41 | 15.8 GB |
| Q5_K_M | 48 | 11.2 GB |
| Q4_K_M | 52 | 9.8 GB |
| IQ4_XS | 46 | 9.2 GB |
| Q4_K_S | 53 | 9.5 GB |
| Q3_K_M | 56 | 8.6 GB |
Speed on GTX 1080 Ti (single card)
| Quant | tokens/sec | Fit? |
|---|---|---|
| Q8_0 | — | OOM |
| Q5_K_M | — | OOM (11.2 GB) |
| Q4_K_M | 13 | Just fits (9.8 GB) |
| IQ4_XS | 11 | Fits (9.2 GB) |
| Q4_K_S | 14 | Fits (9.5 GB) |
| Q3_K_M | 16 | Comfortable (8.6 GB) |
For a single 1080 Ti, Q4_K_M is essentially required for Qwen 3 14B — Q5+ OOMs.
When Each Quant Wins
Pick Q8_0 when:
- You have abundant VRAM (24 GB+ for 8B, 48 GB+ for 14B)
- Quality is critical (production deployments, sensitive evals)
- The 1.4% perplexity gap matters
Pick Q5_K_M when:
- Q8 doesn't fit but you have headroom for the medium tier
- The 0.5% perplexity edge over Q4_K_M matters for your downstream task
- Sweet spot when you have 16 GB+ VRAM
Pick Q4_K_M when (default for most cases):
- VRAM is the constraint
- You need long context (saves VRAM for KV cache)
- Compute is decent (RTX 3090+ or modern hardware) — Pascal too if not using IQ-quants
Pick IQ4_XS when:
- VRAM extremely tight (the IQ tier saves another 5-10%)
- You're on Ampere+ hardware (Pascal pays a speed penalty)
- For 70B+ models trying to fit on 2× 24 GB or single 48 GB
Avoid Q3 and below when:
- Quality matters in any user-facing way
- You have any reasonable alternative
- Q3 is for "model bigger than my VRAM" emergency only
The Pascal-Specific Caveat
GTX 1080 Ti, P40, and other Pascal GPUs lack the SIMD instructions IQ-quants use for fast decode. Empirical pattern:
| Pascal speed (% of Q4_K_M speed) | Quant |
|---|---|
| 100% | Q4_K_M |
| 75-80% | IQ4_XS |
| 95% | Q4_K_S |
Translation: on a 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller. Save the IQ-quants for Ampere+ where the speed penalty disappears.
For dual-GPU 1080 Ti setups (see Ollama Dual GPU Without NVLink), the rule holds — Q4_K_M with split-mode layer is fastest.
Quality Beyond Benchmarks
Perplexity and MMLU capture some quality but miss others. Two important nuances:
Long-form generation degradation
At Q4_K_M and above, generations of 500-2000 tokens are essentially indistinguishable from Q8 in human judgment. At Q3_K_M, you start seeing:
- Subtle factual drift (proper nouns slightly wrong)
- Increased repetition rate
- Loss of nuance in long arguments
The PPL number doesn't capture this well — it's an average across tokens. The long-generation pattern is what your users notice.
Multilingual quality
Non-English languages tend to degrade faster at low quantization. Korean perplexity gaps (8.71 → 8.91 for Llama 3.1 Q4_K_M = +2.3%) are larger than English (6.42 → 6.51 = +1.4%). For Qwen 3 14B, KR PPL gap is +2.7% vs English +2.1%.
If you're serving non-English at low quant, test on real samples in your target language — don't trust only English perplexity.
Coding accuracy
For code generation specifically, quality drops more at low quants than for natural language. Q4_K_M for coding is usually OK; Q3 produces noticeably more syntax errors and logical bugs. If coding is your primary use, stay at Q5_K_M or higher when possible.
Disk Space Reality
If you keep multiple quants of the same model around (a common LocalLLaMA pattern), the disk math adds up fast:
| Model | Q8_0 | Q5_K_M | Q4_K_M | IQ4_XS | Q3_K_M | Q2_K |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | 8.5 | 5.7 | 4.9 | 4.5 | 4.0 | 3.2 GB |
| Qwen 3 14B | 14.5 | 9.7 | 8.4 | 7.6 | 7.0 | 5.4 GB |
| Mixtral 8×7B | 47 | 32 | 27 | 25 | 22 | 17 GB |
| Llama 3.1 70B | 75 | 50 | 42 | 38 | 34 | 26 GB |
Keep one Q8 for high-quality work + one Q4_K_M for routine use is a sensible disk policy. The intermediate Q5_K_M is hard to justify keeping separately on top of those two.
FAQ
Q: I-quants (IQ1, IQ2, IQ3) — when to use? For 70B+ models on consumer hardware where even Q4_K_M won't fit. IQ2_XS lets Llama 70B fit in 24 GB but quality degrades noticeably. Useful when alternative is "can't run at all"; not for daily use.
Q: Does the model architecture matter? MoE vs dense? Yes. MoE models (Mixtral, Qwen3-30B-A3B) tend to degrade differently at low quant — the routing decisions are sensitive. Stay at Q4_K_M or higher for MoE; Q3 is risky.
Q: Q4_K_M vs Q4_0 (older format)? Q4_K_M is uniformly better than Q4_0 (the legacy fixed-bit format). Q4_0 only exists for backward compatibility. Don't use Q4_0 in 2026.
Q: What about Q4_NL or other exotic quants? Q4_NL ("non-linear") is an experimental format. Not widely supported in llama.cpp; performance unpredictable. Stick to K-quants and IQ-quants for production.
Q: How does AWQ or GPTQ compare? AWQ and GPTQ (from the vLLM/Transformers world) are different quantization families using calibration data. For pure inference quality at 4-bit, AWQ slightly edges Q4_K_M on Ampere+. For inference speed and llama.cpp compatibility, Q4_K_M wins. AWQ requires vLLM, which doesn't work on Pascal.
Q: KV cache quantization (q8_0 KV) — does that change the picture? Yes — for long context, quantizing the KV cache to q8_0 saves substantial VRAM at minor quality cost. See Llama.cpp KV Cache Quantization for Old GPUs for the deep dive.
Q: Why does Q4_K_M sometimes outperform Q5_K_S in benchmarks? Mixed precision within K-quants. Q4_K_M dedicates more bits to important weights via its block layout; Q5_K_S uses smaller blocks throughout. For some models the Q4_K_M strategy is better-calibrated.
Q: Does this comparison hold for 70B+ models? Pattern holds qualitatively. For 70B specifically, Q4_K_M perplexity gap from Q8 is smaller (~1.0% on English) — larger models are more quantization-robust. IQ-quants become more attractive at 70B because absolute VRAM savings are bigger.
Closing — The Default Rules
For 2026 LocalLLaMA users:
- RTX 3090 / 4090 with VRAM headroom: Q5_K_M or Q4_K_M
- RTX 3090 / 4090 tight on VRAM: Q4_K_M (sweet spot)
- 2× GTX 1080 Ti split: Q4_K_M (IQ slower on Pascal)
- Single GTX 1080 Ti for 13B+: Q4_K_M, no IQ
- 70B on 24 GB: IQ3_M or IQ4_XS (only options that fit, quality is tradeoff)
Default for 95% of cases: Q4_K_M. Step up to Q5_K_M when VRAM allows and quality matters. Step down to IQ4_XS only on Ampere when VRAM is desperately tight. Q3 and below are emergency-only.
Related posts:
- Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition
- Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks
- LLM VRAM Calculator
References:
- llama.cpp quantization documentation: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
- K-quants paper trail: https://github.com/ggerganov/llama.cpp/pull/1684
- IQ-quants introduction: https://github.com/ggerganov/llama.cpp/pull/4773
- LocalLLaMA quantization threads, 2024-2026
관련 글
Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
5월 18일 · 17 min read
일반Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)
5월 27일 · 10 min read
일반Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
5월 27일 · 13 min read
일반llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)
5월 23일 · 9 min read