Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works, What OOMs

GTX 1080 Ti modern LLM

"The 1080 Ti is dead for AI" — Not Quite

Every other LocalLLaMA post in 2026 says the GTX 1080 Ti (released 2017, Pascal architecture, 11 GB GDDR5X) is finished as an LLM card. That's partly true and partly the people who already bought RTX 4090s rationalizing the upgrade.

The truth is more useful. With the right quantization and the right model, a 1080 Ti still runs useful 7-14 B parameter models at usable speeds in 2026. It can't touch a 4090 on throughput per dollar of electricity, but if you already own one (or two), it's not the paperweight Twitter wants you to believe.

This guide is based on actually running modern 2026 models — Llama 3.1, Qwen 3, Phi-4, Gemma 3 — on a 1080 Ti rig (2 cards in a server). It covers what loads at usable speed, what OOMs immediately, the quantization sweet spot, and the specific limitations of Pascal that don't show up in benchmarks of newer hardware.

The 1080 Ti — Specs That Matter for LLMs in 2026

Spec	GTX 1080 Ti	For LLMs this means
Architecture	Pascal (GP102)	No FP16 tensor cores (came in Volta+)
VRAM	11 GB GDDR5X	Models > 13B at Q4 don't fit
Memory bandwidth	484 GB/s	Decent — close to RTX 3060
FP32 TFLOPS	11.3	Fine for inference
FP16 throughput	Same as FP32	No FP16 speedup (unlike RTX)
INT8 throughput	Same as FP32	No INT8 speedup
Power	250 W TDP	A real cost in 2026 electricity prices
CUDA support	up to CUDA 12.x	Modern drivers still work

The killer detail: Pascal doesn't have tensor cores. RTX cards (Turing onwards) have specialized matmul units that make FP16/INT8/INT4 inference much faster than FP32. On a 1080 Ti, all precisions run at roughly the same speed, just costing different amounts of memory. So quantization saves VRAM but doesn't make things faster — opposite of what you see on a 3090 or 4090.

What Works — Models That Run Usefully (Tested, 2026)

Models tested with llama.cpp / Ollama, GGUF format, 2K context, single 1080 Ti:

Model	Quant	VRAM Used	Tokens/sec	Verdict
Llama 3.1 8B	Q4_K_M	5.0 GB	22-28	✅ Comfortable, fast
Llama 3.1 8B	Q8_0	8.7 GB	18-22	✅ Best quality on 11 GB
Qwen 3 8B	Q4_K_M	5.1 GB	20-26	✅ Solid Korean + reasoning
Qwen 3 14B	Q4_K_M	8.7 GB	11-14	✅ Tight, but works
Gemma 3 12B	Q4_K_M	7.5 GB	14-18	✅ Long context fine if KV smaller
Phi-4 14B	Q4_K_M	8.9 GB	10-13	✅ Best reasoning per VRAM
Mistral 7B Instruct	Q4_K_M	4.5 GB	28-34	✅ Snappy older standby
Llama 3.1 8B	Q5_K_M	6.1 GB	20-25	✅ Quality sweet spot
DeepSeek-Coder 6.7B	Q4_K_M	4.5 GB	26-32	✅ Coding tasks

(Numbers are warm-cache, batch 1, prompt 256 tokens, generation 256 tokens. Your mileage will vary ±20% by driver/CPU/RAM speed.)

The honest comparison

Llama 3.1 8B Q4_K_M on 1080 Ti: ~25 tokens/sec
Llama 3.1 8B Q4_K_M on RTX 3090: ~95 tokens/sec
Llama 3.1 8B Q4_K_M on RTX 4090: ~140 tokens/sec

The 1080 Ti is roughly 1/4 the speed of a 3090 for the same model. Still readable for a single-user chat workflow.

What Does NOT Work — Cards That Just OOM or Crawl

Model	Quant	Result
Llama 3.1 70B	Q4_K_M	❌ OOM (needs ~40 GB)
Llama 3.1 70B	Q2_K	❌ OOM (~26 GB)
Qwen 3 30B-A3B (MoE)	Q4_K_M	⚠️ Loads with CPU offload — 1-3 tokens/sec, painful
Mixtral 8×7B	Q4_K_M	⚠️ Loads with offload — 2-4 tokens/sec
DeepSeek-Coder 33B	Q4_K_M	❌ OOM
Llama 3 70B	IQ2_XXS	⚠️ Loads but 1-2 tokens/sec, quality degraded

Pattern: anything that needs more than ~10 GB after KV cache won't run usefully on a single 1080 Ti. The 30B-A3B MoE models technically load with CPU offload but the throughput is unusable.

This is where the "1080 Ti is dead" narrative comes from. If you've decided you must run 30B+ dense models, the 1080 Ti can't help. But for 7-14B — most of what hobbyists actually need — it's fine.

Two 1080 Tis — Stretching to ~22 GB Usable

If you have two 1080 Tis (and PCIe slots for both), tensor parallelism via llama.cpp's --split-mode opens up:

Model	Quant	VRAM Used (combined)	Tokens/sec	Notes
Llama 3.1 8B	Q8_0	9 GB	25-30	Single card faster (no PCIe bottleneck)
Llama 3.1 13B	Q8_0	14 GB	12-16	Two cards needed
Qwen 3 14B	Q8_0	15 GB	11-15	Single card Q4 is faster than dual Q8
Mixtral 8×7B	Q4_K_M	21 GB	14-18	Dual is the only practical option here
Yi-34B	Q4_K_M	20 GB	7-10	Slow but functional

PCIe is the bottleneck. Two 1080 Tis on x16/x16 slots are faster than on x16/x4. Without NVLink (Pascal doesn't support it), every layer split costs latency.

A full benchmark of split-mode configurations on dual 1080 Tis is in llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition.

For Ollama-specific dual GPU setup (which is more constrained), see Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks).

Specific Limitations You'll Run Into

1. No FP16 speedup

On a 4090, --flash-attn and FP16 KV cache dramatically speed up inference. On a 1080 Ti, FP16 is exactly the same speed as FP32 — Pascal has no tensor cores. Flash attention still helps a bit (memory locality), but the dramatic 2-3× speedup RTX users report isn't there.

Practical implication: don't bother converting models to FP16 hoping for speed. It only saves VRAM at equal throughput.

2. CUDA compatibility — Compute capability 6.1

Pascal is Compute Capability 6.1. Most modern libraries still support it but the ecosystem is moving on:

llama.cpp: full support
Ollama: works (uses llama.cpp underneath)
vLLM: officially Volta+ (CC 7.0+) — does not work on Pascal without unsupported patches
TGI (Text Generation Inference): support varies; modern releases need Ampere+
bitsandbytes 4-bit / 8-bit: requires CC 7.5+ for some kernels → many quantizations work, some don't
Flash Attention 2: requires Ampere+ → not available on 1080 Ti
Triton kernels: many require Volta+

In practice: stick to llama.cpp / Ollama / KoboldCpp ecosystem. That's the path of least resistance for Pascal.

3. Power efficiency

A 1080 Ti pulls ~250 W under inference load. A 4060 Ti 16 GB pulls 165 W for similar throughput on small models. If electricity is $0.20/kWh, running a 1080 Ti 8 h/day costs **$15/month** versus ~$10/month for a 4060 Ti — that adds up over a year.

If you're buying a new card today, the 4060 Ti 16 GB is the energy-efficient successor for hobby LLM work, not a used 1080 Ti.

4. Driver stability with modern CUDA

The 1080 Ti is on legacy NVIDIA drivers (550-series and similar). New CUDA features ship for Ampere/Ada first. Occasionally a new llama.cpp release with Hopper/Blackwell kernels will quietly drop Pascal support. Pin your llama.cpp version once it works.

Quantization Strategy for 11 GB VRAM

The KV cache eats VRAM linearly with context length. Rough budget for a 1080 Ti:

Total VRAM:           11 GB
- Driver overhead:    -0.5 GB
- llama.cpp runtime:  -0.5 GB
- KV cache (8K ctx):  -1.5 GB
                      ------
- Available for weights: ~8.5 GB

That gives you these realistic options:

Use case	Model	Quant	Context
Best quality 8B	Llama 3.1 8B	Q8_0	4K
Balanced 8B	Llama 3.1 8B	Q5_K_M	8K
Long context 8B	Llama 3.1 8B	Q4_K_M	32K (tight)
Largest fit	Phi-4 14B / Qwen 3 14B	Q4_K_M	4K
Coding	DeepSeek-Coder 6.7B	Q5_K_M	16K

For most users, Llama 3.1 8B Q5_K_M @ 8K context is the right default — quality near Q8, room for a useful chat history, comfortable on the GPU.

When to Retire the Card

A 1080 Ti is worth keeping for LLM work if:

You already own it (zero upgrade cost beats anything)
7-14B models meet your needs
Quality matters more than speed (Q8 on smaller models)
You're paired with a second card for occasional 30B+ work

It's time to upgrade if:

You're regularly stuck waiting on 30B+ dense models
Your electricity is expensive enough that the 250 W TDP hurts
You need vLLM, TGI, or modern Triton kernels for batching
You're doing fine-tuning (Pascal can't do FP16 fine-tuning at modern speeds — practically requires Ampere+)

For a fresh build in 2026, the upgrade path is usually:

Tight budget, hobby use: RTX 4060 Ti 16 GB (~$500, 165 W, much faster inference at int4)
Mid budget: Used RTX 3090 24 GB (~$700-900, 350 W, runs 30B comfortably)
High budget: RTX 4090 24 GB or 5090 32 GB

The 1080 Ti is best understood as a card you keep using until a model you need won't run on it.

FAQ

Q: Can I fine-tune on a 1080 Ti? QLoRA on 7B models — yes, slowly. Full fine-tuning of any meaningful model — no, you'll OOM. Pascal's lack of tensor cores also makes training-style FP16 throughput modest.

Q: Does Flash Attention 2 work on 1080 Ti? No. FA2 requires Ampere (CC 8.0+). FA1 partially works but isn't shipped in most precompiled wheels. llama.cpp implements its own efficient attention that does benefit Pascal.

Q: Does vLLM work on 1080 Ti? Officially no — vLLM requires CC 7.0+. Some users patch it for Volta; Pascal patches are scarce and break with vLLM updates. Stick to llama.cpp ecosystem.

Q: Better to buy a used 3090 or two used 1080 Tis? A single 3090 (24 GB, Ampere, tensor cores, ~$700-900) beats two 1080 Tis (22 GB combined, no tensor cores, PCIe bottleneck, ~$300-400) for almost every workload. Two 1080 Tis only win if you already have one and adding a second is much cheaper than upgrading.

Q: What about Ollama specifically? Ollama works fine on a single 1080 Ti. Multi-GPU is less mature than raw llama.cpp — see the dedicated Ollama Dual 1080 Ti post for setup details.

Q: Are old GPUs like 1080 Ti showing up at risk of dying? GDDR5X cards from 2017 are showing increased failure rates in mining-recovered units. If you're buying used, prefer non-mining cards (check thermal pads, fan condition). Lifetime stress test: run llama.cpp benchmark for 6-12 hours and watch temps + stability.

Q: How does this compare to running Ollama on Apple Silicon (M2/M3/M4)? Apple's unified memory is brutally efficient for inference. An M2 Pro with 32 GB unified memory runs Llama 3.1 70B Q4 at usable speeds where a 1080 Ti can't even load it. For pure inference workflows, Apple Silicon is now the surprising value champion. The 1080 Ti's edge is being already-paid-for.

Closing — One Sentence

If you already have a GTX 1080 Ti, in 2026 it still runs 7-14B Llama/Qwen/Phi class models at usable speeds; if you're shopping, buy a used 3090 or a new 4060 Ti 16 GB instead.

Related posts:

References:

NVIDIA GP102 (Pascal) whitepaper, 2017
llama.cpp project: https://github.com/ggerganov/llama.cpp
Ollama: https://ollama.ai
LocalLLaMA community benchmarks (r/LocalLLaMA, 2024-2026)