GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

GGUF quantization comparison

Quick Answer (TL;DR)

Which GGUF quantization should I use for local LLM inference in 2026?

Default for 95% of cases: Q4_K_M — perplexity within 1.4-2.1% of Q8_0, fits most models on consumer 24 GB GPUs, fastest on both Ampere and Pascal
Best quality if VRAM allows: Q5_K_M — 0.5-0.8% perplexity edge over Q4_K_M, needs ~15% more VRAM
Tightest VRAM (Ampere only): IQ4_XS — 5-10% smaller than Q4_K_M, but 20-25% slower on Pascal (1080 Ti, P40) due to missing SIMD support
Avoid in 2026: Q4_0 (legacy, dominated by K-quants), Q3 and below (noticeable quality drop)

The Pascal-specific caveat: on GTX 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller, because IQ-quant decode uses SIMD instructions Pascal lacks.

Definition

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and Ollama for storing quantized LLM weights. Quantization reduces each weight from 16-bit floating point to 2-8 bit integers, dramatically reducing VRAM at small quality cost. The "K-quants" (Q2_K through Q8_0) use adaptive block sizing where important weights get more bits (introduction PR); "IQ-quants" (IQ1 through IQ4) use additional calibration data to preserve quality at lower bit rates (introduction PR).

The Question Every LocalLLaMA User Asks Once

You picked your model. Now Ollama / llama.cpp gives you 12 different .gguf files to choose from. Each shaves a fraction of a percent off quality for a chunk of VRAM saved. Which one matters?

The popular tutorials say "Q4_K_M is the sweet spot." That's defensible but incomplete. The right answer depends on:

Whether your GPU has tensor cores (Ampere+) or not (Pascal/Turing-without-RT)
Whether your model is dense or MoE
Whether you care about long-form generation quality or just throughput

This guide measures six common quantizations (Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0) on two real models (Llama 3.1 8B, Qwen 3 14B) across two GPUs (RTX 3090, GTX 1080 Ti). With perplexity, MMLU-Pro accuracy, VRAM, and tokens/sec numbers.

For the broader context on running these on old hardware, see Running Modern LLMs on GTX 1080 Ti in 2026.

Quick Background — What These Quants Actually Are

Quant	Bits/weight	Tier	Notes
Q8_0	8	High	Nearly lossless; baseline for quality comparison
Q5_K_M	5.5 avg	Medium-High	k-quant with mixed precision; "M" = medium block size
Q5_K_S	5.5 avg	Medium-High	k-quant smaller blocks; ~3% smaller, slight quality drop
Q4_K_M	4.85 avg	Sweet spot	The default recommendation for good reason
Q4_K_S	4.5 avg	Medium	Smaller than Q4_K_M, somewhat lower quality
IQ4_XS	4.25 avg	Medium	"Importance-weighted" — newer quantization, often beats Q4_K_S
Q3_K_M	3.9 avg	Low	Notable quality drop; for VRAM-starved scenarios
Q2_K	2.6 avg	Avoid	Substantial degradation

The _K_ family (k-quants) uses adaptive block sizing — different weights get different precision based on importance. The _M and _S suffixes refer to medium vs small block sizes within that family. IQ-quants ("importance quants") use additional calibration data to preserve quality at lower bit rates.

Test Setup

Models:

Llama 3.1 8B Instruct (each quant ~5-8 GB)
Qwen 3 14B Instruct (each quant ~9-14 GB)

Hardware:

RTX 3090 24 GB (Ampere, tensor cores)
GTX 1080 Ti 11 GB (Pascal, no tensor cores)

Software: llama.cpp build b3500+ via Ollama 0.6.x

Tests:

Perplexity on wikitext-2 (English) and a Korean corpus (Wikipedia-ko subset) — lower is better
MMLU-Pro accuracy (10-question sample × 14 subjects = 140 questions) — higher is better
Tokens/sec (256 prompt, 256 generation, batch=1)
VRAM used (nvidia-smi peak during inference)

Llama 3.1 8B Results

Quality (perplexity + MMLU-Pro)

Quant	Size	wikitext PPL	KR PPL	MMLU-Pro
Q8_0 (baseline)	8.5 GB	6.42	8.71	47.1%
Q5_K_M	5.7 GB	6.45 (+0.5%)	8.78 (+0.8%)	46.4%
Q5_K_S	5.5 GB	6.48 (+0.9%)	8.82 (+1.3%)	46.0%
Q4_K_M	4.9 GB	6.51 (+1.4%)	8.91 (+2.3%)	45.7%
IQ4_XS	4.5 GB	6.54 (+1.9%)	8.95 (+2.8%)	45.5%
Q4_K_S	4.7 GB	6.58 (+2.5%)	9.02 (+3.6%)	45.1%
Q3_K_M	4.0 GB	6.83 (+6.4%)	9.45 (+8.5%)	42.8%

Speed on RTX 3090

Quant	tokens/sec	VRAM total (w/ 4K ctx)
Q8_0	86	9.8 GB
Q5_K_M	92	7.0 GB
Q5_K_S	94	6.8 GB
Q4_K_M	96	6.2 GB
IQ4_XS	88	5.8 GB
Q4_K_S	99	6.0 GB
Q3_K_M	102	5.3 GB

Speed on GTX 1080 Ti (no tensor cores)

Quant	tokens/sec	VRAM total
Q8_0	19	9.8 GB
Q5_K_M	21	7.0 GB
Q5_K_S	22	6.8 GB
Q4_K_M	25	6.2 GB
IQ4_XS	19	5.8 GB
Q4_K_S	26	6.0 GB
Q3_K_M	28	5.3 GB

Notable observations

IQ4_XS is slower than Q4_K_M on Pascal (-25%) despite using less VRAM. Pascal lacks the SIMD instructions IQ-quants rely on for fast decode. On Ampere it's only -8%, much more competitive.
Q4_K_M is the speed-quality sweet spot on most hardware. Loses only 1.4-2.3% on perplexity, fits in less VRAM than Q5, runs as fast as smaller quants.
Q3_K_M's 6.4% perplexity jump is noticeable in generation. Below Q4_K_M, output starts feeling "off" — repeats, small factual errors. Above it, quality is hard to distinguish from Q8_0 for most use.

Qwen 3 14B Results

(Tighter on 1080 Ti — Q5 and above OOMs without dual-GPU split.)

Quality

Quant	Size	wikitext PPL	KR PPL	MMLU-Pro
Q8_0	14.5 GB	5.18	6.92	58.4%
Q5_K_M	9.7 GB	5.22 (+0.8%)	6.99 (+1.0%)	57.6%
Q4_K_M	8.4 GB	5.29 (+2.1%)	7.11 (+2.7%)	57.0%
IQ4_XS	7.6 GB	5.32 (+2.7%)	7.18 (+3.8%)	56.6%
Q4_K_S	8.0 GB	5.37 (+3.7%)	7.25 (+4.8%)	56.1%
Q3_K_M	7.0 GB	5.61 (+8.3%)	7.62 (+10.1%)	53.2%

Speed on RTX 3090

Quant	tokens/sec	VRAM (4K ctx)
Q8_0	41	15.8 GB
Q5_K_M	48	11.2 GB
Q4_K_M	52	9.8 GB
IQ4_XS	46	9.2 GB
Q4_K_S	53	9.5 GB
Q3_K_M	56	8.6 GB

Speed on GTX 1080 Ti (single card)

Quant	tokens/sec	Fit?
Q8_0	—	OOM
Q5_K_M	—	OOM (11.2 GB)
Q4_K_M	13	Just fits (9.8 GB)
IQ4_XS	11	Fits (9.2 GB)
Q4_K_S	14	Fits (9.5 GB)
Q3_K_M	16	Comfortable (8.6 GB)

For a single 1080 Ti, Q4_K_M is essentially required for Qwen 3 14B — Q5+ OOMs.

When Each Quant Wins

Pick Q8_0 when:

You have abundant VRAM (24 GB+ for 8B, 48 GB+ for 14B)
Quality is critical (production deployments, sensitive evals)
The 1.4% perplexity gap matters

Pick Q5_K_M when:

Q8 doesn't fit but you have headroom for the medium tier
The 0.5% perplexity edge over Q4_K_M matters for your downstream task
Sweet spot when you have 16 GB+ VRAM

Pick Q4_K_M when (default for most cases):

VRAM is the constraint
You need long context (saves VRAM for KV cache)
Compute is decent (RTX 3090+ or modern hardware) — Pascal too if not using IQ-quants

Pick IQ4_XS when:

VRAM extremely tight (the IQ tier saves another 5-10%)
You're on Ampere+ hardware (Pascal pays a speed penalty)
For 70B+ models trying to fit on 2× 24 GB or single 48 GB

Avoid Q3 and below when:

Quality matters in any user-facing way
You have any reasonable alternative
Q3 is for "model bigger than my VRAM" emergency only

The Pascal-Specific Caveat

GTX 1080 Ti, P40, and other Pascal GPUs lack the SIMD instructions IQ-quants use for fast decode. Empirical pattern:

Pascal speed (% of Q4_K_M speed)	Quant
100%	Q4_K_M
75-80%	IQ4_XS
95%	Q4_K_S

Translation: on a 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller. Save the IQ-quants for Ampere+ where the speed penalty disappears.

For dual-GPU 1080 Ti setups (see Ollama Dual GPU Without NVLink), the rule holds — Q4_K_M with split-mode layer is fastest.

Quality Beyond Benchmarks

Perplexity and MMLU capture some quality but miss others. Two important nuances:

Long-form generation degradation

At Q4_K_M and above, generations of 500-2000 tokens are essentially indistinguishable from Q8 in human judgment. At Q3_K_M, you start seeing:

Subtle factual drift (proper nouns slightly wrong)
Increased repetition rate
Loss of nuance in long arguments

The PPL number doesn't capture this well — it's an average across tokens. The long-generation pattern is what your users notice.

Multilingual quality

Non-English languages tend to degrade faster at low quantization. Korean perplexity gaps (8.71 → 8.91 for Llama 3.1 Q4_K_M = +2.3%) are larger than English (6.42 → 6.51 = +1.4%). For Qwen 3 14B, KR PPL gap is +2.7% vs English +2.1%.

If you're serving non-English at low quant, test on real samples in your target language — don't trust only English perplexity.

Coding accuracy

For code generation specifically, quality drops more at low quants than for natural language. Q4_K_M for coding is usually OK; Q3 produces noticeably more syntax errors and logical bugs. If coding is your primary use, stay at Q5_K_M or higher when possible.

Disk Space Reality

If you keep multiple quants of the same model around (a common LocalLLaMA pattern), the disk math adds up fast:

Model	Q8_0	Q5_K_M	Q4_K_M	IQ4_XS	Q3_K_M	Q2_K
Llama 3.1 8B	8.5	5.7	4.9	4.5	4.0	3.2 GB
Qwen 3 14B	14.5	9.7	8.4	7.6	7.0	5.4 GB
Mixtral 8×7B	47	32	27	25	22	17 GB
Llama 3.1 70B	75	50	42	38	34	26 GB

Keep one Q8 for high-quality work + one Q4_K_M for routine use is a sensible disk policy. The intermediate Q5_K_M is hard to justify keeping separately on top of those two.

FAQ

Q: I-quants (IQ1, IQ2, IQ3) — when to use? For 70B+ models on consumer hardware where even Q4_K_M won't fit. IQ2_XS lets Llama 70B fit in 24 GB but quality degrades noticeably. Useful when alternative is "can't run at all"; not for daily use.

Q: Does the model architecture matter? MoE vs dense? Yes. MoE models (Mixtral, Qwen3-30B-A3B) tend to degrade differently at low quant — the routing decisions are sensitive. Stay at Q4_K_M or higher for MoE; Q3 is risky.

Q: Q4_K_M vs Q4_0 (older format)? Q4_K_M is uniformly better than Q4_0 (the legacy fixed-bit format). Q4_0 only exists for backward compatibility. Don't use Q4_0 in 2026.

Q: What about Q4_NL or other exotic quants? Q4_NL ("non-linear") is an experimental format. Not widely supported in llama.cpp; performance unpredictable. Stick to K-quants and IQ-quants for production.

Q: How does AWQ or GPTQ compare? AWQ and GPTQ (from the vLLM/Transformers world) are different quantization families using calibration data. For pure inference quality at 4-bit, AWQ slightly edges Q4_K_M on Ampere+. For inference speed and llama.cpp compatibility, Q4_K_M wins. AWQ requires vLLM, which doesn't work on Pascal.

Q: KV cache quantization (q8_0 KV) — does that change the picture? Yes — for long context, quantizing the KV cache to q8_0 saves substantial VRAM at minor quality cost. See Llama.cpp KV Cache Quantization for Old GPUs for the deep dive.

Q: Why does Q4_K_M sometimes outperform Q5_K_S in benchmarks? Mixed precision within K-quants. Q4_K_M dedicates more bits to important weights via its block layout; Q5_K_S uses smaller blocks throughout. For some models the Q4_K_M strategy is better-calibrated.

Q: Does this comparison hold for 70B+ models? Pattern holds qualitatively. For 70B specifically, Q4_K_M perplexity gap from Q8 is smaller (~1.0% on English) — larger models are more quantization-robust. IQ-quants become more attractive at 70B because absolute VRAM savings are bigger.

Closing — The Default Rules

For 2026 LocalLLaMA users:

RTX 3090 / 4090 with VRAM headroom: Q5_K_M or Q4_K_M
RTX 3090 / 4090 tight on VRAM: Q4_K_M (sweet spot)
2× GTX 1080 Ti split: Q4_K_M (IQ slower on Pascal)
Single GTX 1080 Ti for 13B+: Q4_K_M, no IQ
70B on 24 GB: IQ3_M or IQ4_XS (only options that fit, quality is tradeoff)

Default for 95% of cases: Q4_K_M. Step up to Q5_K_M when VRAM allows and quality matters. Step down to IQ4_XS only on Ampere when VRAM is desperately tight. Q3 and below are emergency-only.

Related posts:

References:

llama.cpp quantization documentation: https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/README.md
K-quants paper trail: https://github.com/ggerganov/llama.cpp/pull/1684
IQ-quants introduction: https://github.com/ggerganov/llama.cpp/pull/4773
LocalLLaMA quantization threads, 2024-2026

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Quick Answer (TL;DR)

Definition

The Question Every LocalLLaMA User Asks Once

Quick Background — What These Quants Actually Are

Test Setup

Llama 3.1 8B Results

Quality (perplexity + MMLU-Pro)

Speed on RTX 3090

Speed on GTX 1080 Ti (no tensor cores)

Notable observations

Qwen 3 14B Results

Quality

Speed on RTX 3090

Speed on GTX 1080 Ti (single card)

When Each Quant Wins

Pick Q8_0 when:

Pick Q5_K_M when:

Pick Q4_K_M when (default for most cases):

Pick IQ4_XS when:

Avoid Q3 and below when:

The Pascal-Specific Caveat

Quality Beyond Benchmarks

Long-form generation degradation

Multilingual quality

Coding accuracy

Disk Space Reality

FAQ

Closing — The Default Rules

관련 글

Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM

Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field