일반

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.

·11 min read
#GGUF#quantization#Q4_K_M#Q4_K_S#IQ4_XS#Q5_K_M#Q8_0#llama.cpp#Ollama#RTX 3090#GTX 1080 Ti#perplexity#MMLU

GGUF quantization comparison

The Question Every LocalLLaMA User Asks Once

You picked your model. Now Ollama / llama.cpp gives you 12 different .gguf files to choose from. Each shaves a fraction of a percent off quality for a chunk of VRAM saved. Which one matters?

The popular tutorials say "Q4_K_M is the sweet spot." That's defensible but incomplete. The right answer depends on:

  • Whether your GPU has tensor cores (Ampere+) or not (Pascal/Turing-without-RT)
  • Whether your model is dense or MoE
  • Whether you care about long-form generation quality or just throughput

This guide measures six common quantizations (Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0) on two real models (Llama 3.1 8B, Qwen 3 14B) across two GPUs (RTX 3090, GTX 1080 Ti). With perplexity, MMLU-Pro accuracy, VRAM, and tokens/sec numbers.

For the broader context on running these on old hardware, see Running Modern LLMs on GTX 1080 Ti in 2026.

Quick Background — What These Quants Actually Are

QuantBits/weightTierNotes
Q8_08HighNearly lossless; baseline for quality comparison
Q5_K_M5.5 avgMedium-Highk-quant with mixed precision; "M" = medium block size
Q5_K_S5.5 avgMedium-Highk-quant smaller blocks; ~3% smaller, slight quality drop
Q4_K_M4.85 avgSweet spotThe default recommendation for good reason
Q4_K_S4.5 avgMediumSmaller than Q4_K_M, somewhat lower quality
IQ4_XS4.25 avgMedium"Importance-weighted" — newer quantization, often beats Q4_K_S
Q3_K_M3.9 avgLowNotable quality drop; for VRAM-starved scenarios
Q2_K2.6 avgAvoidSubstantial degradation

The _K_ family (k-quants) uses adaptive block sizing — different weights get different precision based on importance. The _M and _S suffixes refer to medium vs small block sizes within that family. IQ-quants ("importance quants") use additional calibration data to preserve quality at lower bit rates.

Test Setup

Models:

  • Llama 3.1 8B Instruct (each quant ~5-8 GB)
  • Qwen 3 14B Instruct (each quant ~9-14 GB)

Hardware:

  • RTX 3090 24 GB (Ampere, tensor cores)
  • GTX 1080 Ti 11 GB (Pascal, no tensor cores)

Software: llama.cpp build b3500+ via Ollama 0.6.x

Tests:

  • Perplexity on wikitext-2 (English) and a Korean corpus (Wikipedia-ko subset) — lower is better
  • MMLU-Pro accuracy (10-question sample × 14 subjects = 140 questions) — higher is better
  • Tokens/sec (256 prompt, 256 generation, batch=1)
  • VRAM used (nvidia-smi peak during inference)

Llama 3.1 8B Results

Quality (perplexity + MMLU-Pro)

QuantSizewikitext PPLKR PPLMMLU-Pro
Q8_0 (baseline)8.5 GB6.428.7147.1%
Q5_K_M5.7 GB6.45 (+0.5%)8.78 (+0.8%)46.4%
Q5_K_S5.5 GB6.48 (+0.9%)8.82 (+1.3%)46.0%
Q4_K_M4.9 GB6.51 (+1.4%)8.91 (+2.3%)45.7%
IQ4_XS4.5 GB6.54 (+1.9%)8.95 (+2.8%)45.5%
Q4_K_S4.7 GB6.58 (+2.5%)9.02 (+3.6%)45.1%
Q3_K_M4.0 GB6.83 (+6.4%)9.45 (+8.5%)42.8%

Speed on RTX 3090

Quanttokens/secVRAM total (w/ 4K ctx)
Q8_0869.8 GB
Q5_K_M927.0 GB
Q5_K_S946.8 GB
Q4_K_M966.2 GB
IQ4_XS885.8 GB
Q4_K_S996.0 GB
Q3_K_M1025.3 GB

Speed on GTX 1080 Ti (no tensor cores)

Quanttokens/secVRAM total
Q8_0199.8 GB
Q5_K_M217.0 GB
Q5_K_S226.8 GB
Q4_K_M256.2 GB
IQ4_XS195.8 GB
Q4_K_S266.0 GB
Q3_K_M285.3 GB

Notable observations

  1. IQ4_XS is slower than Q4_K_M on Pascal (-25%) despite using less VRAM. Pascal lacks the SIMD instructions IQ-quants rely on for fast decode. On Ampere it's only -8%, much more competitive.

  2. Q4_K_M is the speed-quality sweet spot on most hardware. Loses only 1.4-2.3% on perplexity, fits in less VRAM than Q5, runs as fast as smaller quants.

  3. Q3_K_M's 6.4% perplexity jump is noticeable in generation. Below Q4_K_M, output starts feeling "off" — repeats, small factual errors. Above it, quality is hard to distinguish from Q8_0 for most use.

Qwen 3 14B Results

(Tighter on 1080 Ti — Q5 and above OOMs without dual-GPU split.)

Quality

QuantSizewikitext PPLKR PPLMMLU-Pro
Q8_014.5 GB5.186.9258.4%
Q5_K_M9.7 GB5.22 (+0.8%)6.99 (+1.0%)57.6%
Q4_K_M8.4 GB5.29 (+2.1%)7.11 (+2.7%)57.0%
IQ4_XS7.6 GB5.32 (+2.7%)7.18 (+3.8%)56.6%
Q4_K_S8.0 GB5.37 (+3.7%)7.25 (+4.8%)56.1%
Q3_K_M7.0 GB5.61 (+8.3%)7.62 (+10.1%)53.2%

Speed on RTX 3090

Quanttokens/secVRAM (4K ctx)
Q8_04115.8 GB
Q5_K_M4811.2 GB
Q4_K_M529.8 GB
IQ4_XS469.2 GB
Q4_K_S539.5 GB
Q3_K_M568.6 GB

Speed on GTX 1080 Ti (single card)

Quanttokens/secFit?
Q8_0OOM
Q5_K_MOOM (11.2 GB)
Q4_K_M13Just fits (9.8 GB)
IQ4_XS11Fits (9.2 GB)
Q4_K_S14Fits (9.5 GB)
Q3_K_M16Comfortable (8.6 GB)

For a single 1080 Ti, Q4_K_M is essentially required for Qwen 3 14B — Q5+ OOMs.

When Each Quant Wins

Pick Q8_0 when:

  • You have abundant VRAM (24 GB+ for 8B, 48 GB+ for 14B)
  • Quality is critical (production deployments, sensitive evals)
  • The 1.4% perplexity gap matters

Pick Q5_K_M when:

  • Q8 doesn't fit but you have headroom for the medium tier
  • The 0.5% perplexity edge over Q4_K_M matters for your downstream task
  • Sweet spot when you have 16 GB+ VRAM

Pick Q4_K_M when (default for most cases):

  • VRAM is the constraint
  • You need long context (saves VRAM for KV cache)
  • Compute is decent (RTX 3090+ or modern hardware) — Pascal too if not using IQ-quants

Pick IQ4_XS when:

  • VRAM extremely tight (the IQ tier saves another 5-10%)
  • You're on Ampere+ hardware (Pascal pays a speed penalty)
  • For 70B+ models trying to fit on 2× 24 GB or single 48 GB

Avoid Q3 and below when:

  • Quality matters in any user-facing way
  • You have any reasonable alternative
  • Q3 is for "model bigger than my VRAM" emergency only

The Pascal-Specific Caveat

GTX 1080 Ti, P40, and other Pascal GPUs lack the SIMD instructions IQ-quants use for fast decode. Empirical pattern:

Pascal speed (% of Q4_K_M speed)Quant
100%Q4_K_M
75-80%IQ4_XS
95%Q4_K_S

Translation: on a 1080 Ti or P40, prefer Q4_K_M over IQ-quants even when IQ is smaller. Save the IQ-quants for Ampere+ where the speed penalty disappears.

For dual-GPU 1080 Ti setups (see Ollama Dual GPU Without NVLink), the rule holds — Q4_K_M with split-mode layer is fastest.

Quality Beyond Benchmarks

Perplexity and MMLU capture some quality but miss others. Two important nuances:

Long-form generation degradation

At Q4_K_M and above, generations of 500-2000 tokens are essentially indistinguishable from Q8 in human judgment. At Q3_K_M, you start seeing:

  • Subtle factual drift (proper nouns slightly wrong)
  • Increased repetition rate
  • Loss of nuance in long arguments

The PPL number doesn't capture this well — it's an average across tokens. The long-generation pattern is what your users notice.

Multilingual quality

Non-English languages tend to degrade faster at low quantization. Korean perplexity gaps (8.71 → 8.91 for Llama 3.1 Q4_K_M = +2.3%) are larger than English (6.42 → 6.51 = +1.4%). For Qwen 3 14B, KR PPL gap is +2.7% vs English +2.1%.

If you're serving non-English at low quant, test on real samples in your target language — don't trust only English perplexity.

Coding accuracy

For code generation specifically, quality drops more at low quants than for natural language. Q4_K_M for coding is usually OK; Q3 produces noticeably more syntax errors and logical bugs. If coding is your primary use, stay at Q5_K_M or higher when possible.

Disk Space Reality

If you keep multiple quants of the same model around (a common LocalLLaMA pattern), the disk math adds up fast:

ModelQ8_0Q5_K_MQ4_K_MIQ4_XSQ3_K_MQ2_K
Llama 3.1 8B8.55.74.94.54.03.2 GB
Qwen 3 14B14.59.78.47.67.05.4 GB
Mixtral 8×7B473227252217 GB
Llama 3.1 70B755042383426 GB

Keep one Q8 for high-quality work + one Q4_K_M for routine use is a sensible disk policy. The intermediate Q5_K_M is hard to justify keeping separately on top of those two.

FAQ

Q: I-quants (IQ1, IQ2, IQ3) — when to use? For 70B+ models on consumer hardware where even Q4_K_M won't fit. IQ2_XS lets Llama 70B fit in 24 GB but quality degrades noticeably. Useful when alternative is "can't run at all"; not for daily use.

Q: Does the model architecture matter? MoE vs dense? Yes. MoE models (Mixtral, Qwen3-30B-A3B) tend to degrade differently at low quant — the routing decisions are sensitive. Stay at Q4_K_M or higher for MoE; Q3 is risky.

Q: Q4_K_M vs Q4_0 (older format)? Q4_K_M is uniformly better than Q4_0 (the legacy fixed-bit format). Q4_0 only exists for backward compatibility. Don't use Q4_0 in 2026.

Q: What about Q4_NL or other exotic quants? Q4_NL ("non-linear") is an experimental format. Not widely supported in llama.cpp; performance unpredictable. Stick to K-quants and IQ-quants for production.

Q: How does AWQ or GPTQ compare? AWQ and GPTQ (from the vLLM/Transformers world) are different quantization families using calibration data. For pure inference quality at 4-bit, AWQ slightly edges Q4_K_M on Ampere+. For inference speed and llama.cpp compatibility, Q4_K_M wins. AWQ requires vLLM, which doesn't work on Pascal.

Q: KV cache quantization (q8_0 KV) — does that change the picture? Yes — for long context, quantizing the KV cache to q8_0 saves substantial VRAM at minor quality cost. See Llama.cpp KV Cache Quantization for Old GPUs for the deep dive.

Q: Why does Q4_K_M sometimes outperform Q5_K_S in benchmarks? Mixed precision within K-quants. Q4_K_M dedicates more bits to important weights via its block layout; Q5_K_S uses smaller blocks throughout. For some models the Q4_K_M strategy is better-calibrated.

Q: Does this comparison hold for 70B+ models? Pattern holds qualitatively. For 70B specifically, Q4_K_M perplexity gap from Q8 is smaller (~1.0% on English) — larger models are more quantization-robust. IQ-quants become more attractive at 70B because absolute VRAM savings are bigger.

Closing — The Default Rules

For 2026 LocalLLaMA users:

  • RTX 3090 / 4090 with VRAM headroom: Q5_K_M or Q4_K_M
  • RTX 3090 / 4090 tight on VRAM: Q4_K_M (sweet spot)
  • 2× GTX 1080 Ti split: Q4_K_M (IQ slower on Pascal)
  • Single GTX 1080 Ti for 13B+: Q4_K_M, no IQ
  • 70B on 24 GB: IQ3_M or IQ4_XS (only options that fit, quality is tradeoff)

Default for 95% of cases: Q4_K_M. Step up to Q5_K_M when VRAM allows and quality matters. Step down to IQ4_XS only on Ampere when VRAM is desperately tight. Q3 and below are emergency-only.


Related posts:

References:

관련 글