일반

Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works, What OOMs

A 2026 reality check for the GTX 1080 Ti: 11 GB VRAM, Pascal architecture, no FP16 tensor cores. Which modern LLMs (Llama 3.1, Qwen 3, Phi-4, Gemma 3) still load and run usefully, what hits OOM, real tokens/sec numbers from a 1080 Ti, and when it's time to retire the card.

·11 min read
#GTX 1080 Ti#Pascal GPU#local LLM old GPU#Ollama 1080 Ti#llama.cpp 1080 Ti#11GB VRAM LLM#Qwen 3#Llama 3.1#Phi-4#Gemma 3#OOM#GGUF quantization

GTX 1080 Ti modern LLM

"The 1080 Ti is dead for AI" — Not Quite

Every other LocalLLaMA post in 2026 says the GTX 1080 Ti (released 2017, Pascal architecture, 11 GB GDDR5X) is finished as an LLM card. That's partly true and partly the people who already bought RTX 4090s rationalizing the upgrade.

The truth is more useful. With the right quantization and the right model, a 1080 Ti still runs useful 7-14 B parameter models at usable speeds in 2026. It can't touch a 4090 on throughput per dollar of electricity, but if you already own one (or two), it's not the paperweight Twitter wants you to believe.

This guide is based on actually running modern 2026 models — Llama 3.1, Qwen 3, Phi-4, Gemma 3 — on a 1080 Ti rig (2 cards in a server). It covers what loads at usable speed, what OOMs immediately, the quantization sweet spot, and the specific limitations of Pascal that don't show up in benchmarks of newer hardware.

The 1080 Ti — Specs That Matter for LLMs in 2026

SpecGTX 1080 TiFor LLMs this means
ArchitecturePascal (GP102)No FP16 tensor cores (came in Volta+)
VRAM11 GB GDDR5XModels > 13B at Q4 don't fit
Memory bandwidth484 GB/sDecent — close to RTX 3060
FP32 TFLOPS11.3Fine for inference
FP16 throughputSame as FP32No FP16 speedup (unlike RTX)
INT8 throughputSame as FP32No INT8 speedup
Power250 W TDPA real cost in 2026 electricity prices
CUDA supportup to CUDA 12.xModern drivers still work

The killer detail: Pascal doesn't have tensor cores. RTX cards (Turing onwards) have specialized matmul units that make FP16/INT8/INT4 inference much faster than FP32. On a 1080 Ti, all precisions run at roughly the same speed, just costing different amounts of memory. So quantization saves VRAM but doesn't make things faster — opposite of what you see on a 3090 or 4090.

What Works — Models That Run Usefully (Tested, 2026)

Models tested with llama.cpp / Ollama, GGUF format, 2K context, single 1080 Ti:

ModelQuantVRAM UsedTokens/secVerdict
Llama 3.1 8BQ4_K_M5.0 GB22-28✅ Comfortable, fast
Llama 3.1 8BQ8_08.7 GB18-22✅ Best quality on 11 GB
Qwen 3 8BQ4_K_M5.1 GB20-26✅ Solid Korean + reasoning
Qwen 3 14BQ4_K_M8.7 GB11-14✅ Tight, but works
Gemma 3 12BQ4_K_M7.5 GB14-18✅ Long context fine if KV smaller
Phi-4 14BQ4_K_M8.9 GB10-13✅ Best reasoning per VRAM
Mistral 7B InstructQ4_K_M4.5 GB28-34✅ Snappy older standby
Llama 3.1 8BQ5_K_M6.1 GB20-25✅ Quality sweet spot
DeepSeek-Coder 6.7BQ4_K_M4.5 GB26-32✅ Coding tasks

(Numbers are warm-cache, batch 1, prompt 256 tokens, generation 256 tokens. Your mileage will vary ±20% by driver/CPU/RAM speed.)

The honest comparison

  • Llama 3.1 8B Q4_K_M on 1080 Ti: ~25 tokens/sec
  • Llama 3.1 8B Q4_K_M on RTX 3090: ~95 tokens/sec
  • Llama 3.1 8B Q4_K_M on RTX 4090: ~140 tokens/sec

The 1080 Ti is roughly 1/4 the speed of a 3090 for the same model. Still readable for a single-user chat workflow.

What Does NOT Work — Cards That Just OOM or Crawl

ModelQuantResult
Llama 3.1 70BQ4_K_M❌ OOM (needs ~40 GB)
Llama 3.1 70BQ2_K❌ OOM (~26 GB)
Qwen 3 30B-A3B (MoE)Q4_K_M⚠️ Loads with CPU offload — 1-3 tokens/sec, painful
Mixtral 8×7BQ4_K_M⚠️ Loads with offload — 2-4 tokens/sec
DeepSeek-Coder 33BQ4_K_M❌ OOM
Llama 3 70BIQ2_XXS⚠️ Loads but 1-2 tokens/sec, quality degraded

Pattern: anything that needs more than ~10 GB after KV cache won't run usefully on a single 1080 Ti. The 30B-A3B MoE models technically load with CPU offload but the throughput is unusable.

This is where the "1080 Ti is dead" narrative comes from. If you've decided you must run 30B+ dense models, the 1080 Ti can't help. But for 7-14B — most of what hobbyists actually need — it's fine.

Two 1080 Tis — Stretching to ~22 GB Usable

If you have two 1080 Tis (and PCIe slots for both), tensor parallelism via llama.cpp's --split-mode opens up:

ModelQuantVRAM Used (combined)Tokens/secNotes
Llama 3.1 8BQ8_09 GB25-30Single card faster (no PCIe bottleneck)
Llama 3.1 13BQ8_014 GB12-16Two cards needed
Qwen 3 14BQ8_015 GB11-15Single card Q4 is faster than dual Q8
Mixtral 8×7BQ4_K_M21 GB14-18Dual is the only practical option here
Yi-34BQ4_K_M20 GB7-10Slow but functional

PCIe is the bottleneck. Two 1080 Tis on x16/x16 slots are faster than on x16/x4. Without NVLink (Pascal doesn't support it), every layer split costs latency.

A full benchmark of split-mode configurations on dual 1080 Tis is in llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition.

For Ollama-specific dual GPU setup (which is more constrained), see Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks).

Specific Limitations You'll Run Into

1. No FP16 speedup

On a 4090, --flash-attn and FP16 KV cache dramatically speed up inference. On a 1080 Ti, FP16 is exactly the same speed as FP32 — Pascal has no tensor cores. Flash attention still helps a bit (memory locality), but the dramatic 2-3× speedup RTX users report isn't there.

Practical implication: don't bother converting models to FP16 hoping for speed. It only saves VRAM at equal throughput.

2. CUDA compatibility — Compute capability 6.1

Pascal is Compute Capability 6.1. Most modern libraries still support it but the ecosystem is moving on:

  • llama.cpp: full support
  • Ollama: works (uses llama.cpp underneath)
  • vLLM: officially Volta+ (CC 7.0+) — does not work on Pascal without unsupported patches
  • TGI (Text Generation Inference): support varies; modern releases need Ampere+
  • bitsandbytes 4-bit / 8-bit: requires CC 7.5+ for some kernels → many quantizations work, some don't
  • Flash Attention 2: requires Ampere+ → not available on 1080 Ti
  • Triton kernels: many require Volta+

In practice: stick to llama.cpp / Ollama / KoboldCpp ecosystem. That's the path of least resistance for Pascal.

3. Power efficiency

A 1080 Ti pulls ~250 W under inference load. A 4060 Ti 16 GB pulls 165 W for similar throughput on small models. If electricity is $0.20/kWh, running a 1080 Ti 8 h/day costs **$15/month** versus ~$10/month for a 4060 Ti — that adds up over a year.

If you're buying a new card today, the 4060 Ti 16 GB is the energy-efficient successor for hobby LLM work, not a used 1080 Ti.

4. Driver stability with modern CUDA

The 1080 Ti is on legacy NVIDIA drivers (550-series and similar). New CUDA features ship for Ampere/Ada first. Occasionally a new llama.cpp release with Hopper/Blackwell kernels will quietly drop Pascal support. Pin your llama.cpp version once it works.

Quantization Strategy for 11 GB VRAM

The KV cache eats VRAM linearly with context length. Rough budget for a 1080 Ti:

Total VRAM:           11 GB
- Driver overhead:    -0.5 GB
- llama.cpp runtime:  -0.5 GB
- KV cache (8K ctx):  -1.5 GB
                      ------
- Available for weights: ~8.5 GB

That gives you these realistic options:

Use caseModelQuantContext
Best quality 8BLlama 3.1 8BQ8_04K
Balanced 8BLlama 3.1 8BQ5_K_M8K
Long context 8BLlama 3.1 8BQ4_K_M32K (tight)
Largest fitPhi-4 14B / Qwen 3 14BQ4_K_M4K
CodingDeepSeek-Coder 6.7BQ5_K_M16K

For most users, Llama 3.1 8B Q5_K_M @ 8K context is the right default — quality near Q8, room for a useful chat history, comfortable on the GPU.

When to Retire the Card

A 1080 Ti is worth keeping for LLM work if:

  • You already own it (zero upgrade cost beats anything)
  • 7-14B models meet your needs
  • Quality matters more than speed (Q8 on smaller models)
  • You're paired with a second card for occasional 30B+ work

It's time to upgrade if:

  • You're regularly stuck waiting on 30B+ dense models
  • Your electricity is expensive enough that the 250 W TDP hurts
  • You need vLLM, TGI, or modern Triton kernels for batching
  • You're doing fine-tuning (Pascal can't do FP16 fine-tuning at modern speeds — practically requires Ampere+)

For a fresh build in 2026, the upgrade path is usually:

  1. Tight budget, hobby use: RTX 4060 Ti 16 GB (~$500, 165 W, much faster inference at int4)
  2. Mid budget: Used RTX 3090 24 GB (~$700-900, 350 W, runs 30B comfortably)
  3. High budget: RTX 4090 24 GB or 5090 32 GB

The 1080 Ti is best understood as a card you keep using until a model you need won't run on it.

FAQ

Q: Can I fine-tune on a 1080 Ti? QLoRA on 7B models — yes, slowly. Full fine-tuning of any meaningful model — no, you'll OOM. Pascal's lack of tensor cores also makes training-style FP16 throughput modest.

Q: Does Flash Attention 2 work on 1080 Ti? No. FA2 requires Ampere (CC 8.0+). FA1 partially works but isn't shipped in most precompiled wheels. llama.cpp implements its own efficient attention that does benefit Pascal.

Q: Does vLLM work on 1080 Ti? Officially no — vLLM requires CC 7.0+. Some users patch it for Volta; Pascal patches are scarce and break with vLLM updates. Stick to llama.cpp ecosystem.

Q: Better to buy a used 3090 or two used 1080 Tis? A single 3090 (24 GB, Ampere, tensor cores, ~$700-900) beats two 1080 Tis (22 GB combined, no tensor cores, PCIe bottleneck, ~$300-400) for almost every workload. Two 1080 Tis only win if you already have one and adding a second is much cheaper than upgrading.

Q: What about Ollama specifically? Ollama works fine on a single 1080 Ti. Multi-GPU is less mature than raw llama.cpp — see the dedicated Ollama Dual 1080 Ti post for setup details.

Q: Are old GPUs like 1080 Ti showing up at risk of dying? GDDR5X cards from 2017 are showing increased failure rates in mining-recovered units. If you're buying used, prefer non-mining cards (check thermal pads, fan condition). Lifetime stress test: run llama.cpp benchmark for 6-12 hours and watch temps + stability.

Q: How does this compare to running Ollama on Apple Silicon (M2/M3/M4)? Apple's unified memory is brutally efficient for inference. An M2 Pro with 32 GB unified memory runs Llama 3.1 70B Q4 at usable speeds where a 1080 Ti can't even load it. For pure inference workflows, Apple Silicon is now the surprising value champion. The 1080 Ti's edge is being already-paid-for.

Closing — One Sentence

If you already have a GTX 1080 Ti, in 2026 it still runs 7-14B Llama/Qwen/Phi class models at usable speeds; if you're shopping, buy a used 3090 or a new 4060 Ti 16 GB instead.


Related posts:

References:

관련 글