4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks
Practical build guide for running four GTX 1080 Tis in a single rig — 44 GB combined VRAM at roughly half the cost of a used RTX 3090. Covers PCIe slot configurations on HEDT and Threadripper boards, 1500W+ PSU sizing, cooling (1000W heat dissipation), llama.cpp tensor-split setup, expected throughput on 70B Llama, Mixtral 8×7B, and Qwen3.6-35B-A3B, plus the honest cases where this is not the right choice.
The Math That Makes This Build Tempting
In 2026, the used GPU market has odd shapes. RTX 3090s hold value (~₩1.1M / $800), 4090s stay expensive, and Pascal-era cards keep falling. A used GTX 1080 Ti goes for about ₩200K-250K ($150-180) in Korea. Multiply by four:
| Hardware | Approx Cost (2026) | Combined VRAM |
|---|---|---|
| 4× used GTX 1080 Ti | ~₩900K + frame + PSU | 44 GB |
| 1× used RTX 3090 | ₩1.1-1.3M | 24 GB |
| 2× used RTX 3090 + NVLink-less | ₩2.3M | 48 GB |
| 1× new RTX 4090 | ~₩2.5M | 24 GB |
| 1× new RTX 5090 | ~₩3.5-4.5M | 32 GB |
The 4× 1080 Ti is the cheapest way to put >40 GB of VRAM on a single Linux box. That's enough to fit:
- Llama 3.1 70B Q4_K_M (~42 GB) — just barely
- Mixtral 8×7B Q5_K_M (~32 GB) — with room
- Qwen3.6-35B-A3B Q4_K_M (~21 GB) — with massive context budget
- 2-3 medium models concurrently for multi-model serving
This guide is the practical end-to-end build — hardware, software, and the limits you'll hit. Built on the Running Modern LLMs on GTX 1080 Ti in 2026 baseline, extended to four cards.
Why This Is Genuinely Hard
Before the parts list, the honest constraints:
- No NVLink — Pascal consumer cards never supported it. All inter-GPU traffic crosses PCIe lanes
- PCIe lane allocation — most consumer boards can't give four GPUs adequate lanes
- 1000W of GPU power — needs serious PSU + cooling + ventilation
- Physical fit — four dual-slot cards = 8 PCIe slots' worth of physical space
- Software ecosystem aging — modern libraries (vLLM, Flash Attention 2) require Ampere+
If you read those and shrug, continue. If any seem alarming, the single RTX 3090 path is almost certainly better.
Hardware — What Actually Works
Motherboard / Platform (the critical choice)
Consumer mainstream boards (Z690, B650 etc.) max out at 2 GPUs with usable lanes. For four cards you need one of:
Option A — Used HEDT (best price/value)
- X299 + Intel Core X-Series (i9-7900X+, i9-10900X): 28-44 PCIe 3.0 lanes from CPU, supports x16/x16/x8/x8 or x16/x8/x8/x8 depending on board. Used X299 boards ~₩200K, CPU ~₩150-250K.
- Specific recommendations: ASUS ROG Strix X299-E Gaming II, MSI X299 Pro 10G
Option B — Threadripper Pro (best lanes, expensive)
- TRX40 / WRX80 boards with Threadripper Pro: 128 PCIe 4.0 lanes → x16/x16/x16/x16 trivial
- Used: ~₩800K-1.5M for board + CPU combo
- Massive overkill but the cleanest 4-GPU PCIe topology
Option C — Old server hardware (cheapest, hardest)
- Dual Xeon E5-2680v4 + SuperMicro X10DRi board: 80 PCIe 3.0 lanes
- Used pricing: ~₩300-500K for board + 2 CPUs
- Requires server case / open-air, deal with IPMI BIOS quirks
Option D — Mining rig with risers (avoid for LLM)
- Cheap motherboards (B250 etc.) with 6 PCIe slots via x1 risers
- Don't do this for LLM inference: x1 risers are 1/16 the bandwidth → multi-GPU inference crawls
For most builders, used X299 is the sweet spot. The PCIe 3.0 generation matches Pascal cards perfectly (1080 Ti is PCIe 3.0 x16).
Power Supply
Each 1080 Ti pulls up to 250W under sustained inference load (250W TDP, less in practice but plan for max). System overhead: CPU 100-200W, motherboard + storage + fans 50-100W.
Total budget for 4-GPU under load:
4 × 250W (GPUs) = 1000W
CPU = ~150W
Other = ~100W
------
Sustained load = ~1250W
Peak transient = ~1400W (during model loads)
PSU sizing: 1500W minimum, 1600W recommended.
Options:
- Corsair AX1600i (~₩600K used / ₩900K new): platinum efficient, modular, proven
- EVGA SuperNOVA 1600 P+ (~₩500K used): great if you find one
- Server PSU adapters (mining-style): cheap (~₩100K total for 2× 1000W server PSUs + breakout boards), but DIY wiring concerns. Acceptable for hobby builds, not for "set and forget"
A 1200W PSU will trip when all 4 GPUs are doing peak load simultaneously. Don't risk it.
Cooling
1000W of GPU heat in a confined space is the realistic challenge. Three approaches:
Open-air mining frame (recommended for 4-GPU)
- Aluminum frame, ~30K KRW
- GPUs sit horizontal with ~3-4cm spacing
- 3-4 case fans blowing across
- Room ventilation: open window or HVAC essential
- Noise: loud (multiple GPU fans at high RPM)
Server case with 4-slot spacing
- Some EATX cases (Phanteks Enthoo Pro, Corsair Air 540) fit four dual-slot cards
- Better aesthetics, similar thermal performance with good front intake fans
- Cost: ₩200-300K
Avoid: standard mid-tower cases. Four dual-slot GPUs cooks itself — top cards will hit 90°C+ and throttle.
Ambient temperature: a closed room with 4× 1080 Ti at full load rises 5-8°C per hour. AC or constant ventilation is non-negotiable for sustained inference.
Other components
- Storage: 1TB NVMe minimum. Models are large (Llama 70B Q4 = 42 GB). 2TB recommended if keeping multiple quants.
- RAM: 64-128GB. llama.cpp's CPU offload uses system RAM heavily during model loads.
- CPU: any 8+ core modern CPU works. Used i9-9900K, Ryzen 9 5900X both fine.
Total Build Cost Estimate (mid-range, used parts where sensible)
| Component | Approx ₩ | Notes |
|---|---|---|
| 4× GTX 1080 Ti (used) | 900K | Non-mining where possible |
| X299 motherboard (used) | 200K | ASUS ROG Strix X299-E Gaming II |
| Intel i9-10900X (used) | 200K | 10-core, 44 PCIe lanes |
| 64GB DDR4-3200 | 150K | 4× 16GB |
| 1TB NVMe SSD | 100K | Samsung 980 Pro |
| Corsair AX1600i PSU (used) | 500K | Or 2× server PSU = 200K |
| Mining frame + 4 case fans | 50K | |
| Subtotal | ₩2.1M |
Compare to:
- 1× new RTX 4090 + decent system: ₩3.5M (24 GB VRAM)
- 1× new RTX 5090 + decent system: ₩4.5M (32 GB VRAM)
- 1× used RTX 3090 + decent system: ₩2.0M (24 GB VRAM)
4× 1080 Ti is competitive with single RTX 3090 system pricing but gives ~2× the VRAM. That's the whole pitch.
Software Setup
Driver and CUDA
# Ubuntu 22.04 / 24.04 LTS recommended
sudo apt install nvidia-driver-555 nvidia-cuda-toolkit # Or newer if available
sudo reboot
nvidia-smi # Verify all 4 cards visible
You should see four GPUs at PCIe 3.0 x16 or x8 (depending on board). If any show as x4 or x1, recheck slot allocation in BIOS.
Verify PCIe configuration
nvidia-smi --query-gpu=index,name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv
Target: all 4 GPUs at gen 3, width 8 or 16. If you see gen 1 x4, you have a riser or BIOS issue.
Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j16
Or use the prebuilt binaries from the releases page.
First multi-GPU test
# Download Mixtral 8×7B Q5_K_M (~32 GB) — fits comfortably
huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf
# Run with explicit 4-way split
./llama-cli -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf \
--threads 16 \
--ctx-size 8192 \
-ngl 99 \
--split-mode layer \
--tensor-split 11,11,11,11 \
--flash-attn \
-p "Hello, give me a brief introduction to yourself."
In another terminal, watch all 4 cards engage:
watch -n 0.5 'nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv'
You should see ~8 GB on each card and ~30-60% utilization. If one card is at 0%, your split or -ngl flag didn't take effect.
Ollama for friendlier multi-GPU
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Configure for 4-GPU spread
sudo systemctl edit ollama
Add:
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=24h"
Restart: sudo systemctl daemon-reload && sudo systemctl restart ollama
Then:
ollama pull llama3.1:70b-instruct-q4_K_M
ollama run llama3.1:70b "Explain attention mechanism in transformers."
Detailed Ollama multi-GPU tuning (with focus on dual setup): Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti. The principles extend to 4-GPU; just set 4 weights in tensor-split.
Expected Throughput — Real-World Estimates
Numbers below are estimates extrapolated from 2-GPU measurements + community 4-GPU benchmarks. Single-card baseline used as anchor.
Llama 3.1 70B Q4_K_M (~42 GB)
| Configuration | tokens/sec | Notes |
|---|---|---|
| Single 1080 Ti | OOM (only 11 GB) | Won't load |
| 2× 1080 Ti (Ollama spread) | OOM | Still ~22 GB combined < 42 |
| 4× 1080 Ti (layer split) | 8-12 | Just fits, PCIe bottleneck dominates |
| Single RTX 3090 (Q3_K_M) | 18-22 | Lower quant required |
| 2× RTX 3090 NVLink-less | 16-22 (Q4) | NVLink wouldn't exist on consumer 3090s, but ~2x bandwidth helps |
The 4× 1080 Ti running Llama 70B is the cheapest way to run a full 70B at Q4 as of 2026. Throughput is modest but functional for interactive use (8-12 t/s = ~500-700 tokens/min, fine for chat).
Mixtral 8×7B Q5_K_M (~32 GB)
| Configuration | tokens/sec | Notes |
|---|---|---|
| Single 1080 Ti | OOM | |
| 2× 1080 Ti | OOM at Q5 (fits Q4) | |
| 4× 1080 Ti (layer split) | 18-25 | Comfortable fit, headroom for ctx |
| Single RTX 3090 (Q4_K_M, ~24 GB) | 50-65 | RTX 3090 wins single-card MoE |
For Mixtral, single 3090 is better. 4× 1080 Ti's edge is enabling higher quant (Q5 vs Q4) and more context.
Qwen3.6-35B-A3B Q4_K_M (~21 GB)
| Configuration | tokens/sec | Notes |
|---|---|---|
| Single 1080 Ti | OOM | |
| 2× 1080 Ti | 25-35 | Fits comfortably |
| 4× 1080 Ti | 30-40 | Diminishing returns; PCIe overhead grows |
| Single RTX 3090 | 50-65 | Fastest, no split overhead |
For Qwen3.6 specifically, single 3090 outperforms 4× 1080 Ti by ~2×. Use 4× 1080 Ti only for models the 3090 can't fit. See Running Qwen3.6-35B-A3B on RTX 3090.
Multi-model concurrent serving
44 GB enables 2-3 medium models simultaneously:
44 GB total
- Llama 3.1 8B Q8_0 (~9 GB) → leaves 35 GB
- DeepSeek-Coder 14B Q4_K_M (~8.5 GB) → leaves 26.5 GB
- Qwen3.6-35B-A3B Q3_K_M (~14 GB) → leaves 12.5 GB
↑
Tight but viable for KV cache
With Ollama's OLLAMA_MAX_LOADED_MODELS=3 and OLLAMA_KEEP_ALIVE=24h, you can host three specialized models always-loaded. This is the genuine niche for the 4-card build: a personal multi-model server for varied workloads.
See Ollama OLLAMA_KEEP_ALIVE — Model Memory Persistence Deep Dive for multi-model scheduling.
PCIe Lane Reality — The Hidden Bottleneck
On X299 with i9-10900X (44 lanes), typical 4-GPU allocation:
Slot 1: PCIe 3.0 x16 (full bandwidth)
Slot 2: PCIe 3.0 x8
Slot 3: PCIe 3.0 x8
Slot 4: PCIe 3.0 x8
Layer-split inference does NOT need all-GPUs-at-x16. The traffic per layer is small (~kilobytes), and llama.cpp's pipeline keeps GPUs busy with their assigned layers between PCIe transfers.
But for row split (which all-reduces activations every layer), the lower-bandwidth GPUs become bottlenecks. Always use --split-mode layer on PCIe-only multi-GPU.
For row vs layer details: llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition.
When 4× 1080 Ti Is the Right Build
Genuinely yes if:
- You already own 2+ 1080 Tis and adding 2 more costs ~₩500K
- You need 40+ GB VRAM at minimum cost
- You have HEDT board / spare PCIe lanes / Threadripper Pro
- You're doing research / hobby, not production (electricity matters less)
- You want to host 2-3 medium models simultaneously
- Llama 70B occasional use justifies the build
When 4× 1080 Ti Is the Wrong Build
Honestly, more cases than people admit:
- You're optimizing for tokens/sec: single RTX 3090 ($1.1M used) beats 4× 1080 Ti on every model that fits in 24 GB
- 24/7 operation matters: 1000W × 24h × 365d × ₩200/kWh = ~₩1.75M/year in electricity. Vs a 350W RTX 3090 at ~₩600K/year. Pays back the 3090 in <2 years.
- Production deployment: vLLM (Volta+), modern serving frameworks, none support Pascal. You're stuck with llama.cpp / Ollama.
- Fine-tuning: Pascal can do QLoRA on smaller models slowly. Full fine-tuning on 70B requires modern hardware.
- Noise / room constraints: 1000W in a closed home office is brutal. Cards screaming at 100% fan ≈ hairdryer noise level.
For most people in 2026 starting fresh: a single used RTX 3090 ($800-$900) is a saner choice than a 4× 1080 Ti rig. The 4-card build is for people specifically targeting Llama 70B+ at minimum capital cost, accepting the operational tradeoffs.
FAQ
Q: Can I mix 1080 Ti with another GPU (e.g., 3 × 1080 Ti + 1 × 3090)?
Yes for llama.cpp / Ollama. Use --tensor-split weights to allocate by VRAM (e.g., --tensor-split 11,11,11,24 if 4th card is 3090). The mixed setup is bandwidth-limited by the slowest card, but more VRAM is more VRAM.
Q: Do mining-recovered 1080 Tis work for this?
Often yes, with caveats. Mining stresses VRAM modules and PCIe contacts. Test extensively: run gpu-burn for 6+ hours, watch for memory errors. Re-pad thermal pads if cards run hot. Prefer non-mining secondhand if available; mining cards typically cost ₩50K less for a reason.
Q: Why not just use cloud APIs at this cost?
Comparison: ₩2.1M one-time + ~₩150K/month electricity vs Claude API at ~₩200-300/month for hobby usage. For pure cost on small workloads, cloud wins. The 4× 1080 Ti makes sense if (a) you value privacy/local data, (b) you have heavy token volume (>5-10M/month), (c) you're learning multi-GPU systems engineering as a goal in itself.
Q: Can I run two simultaneous large models (Llama 70B + Mixtral)?
Llama 70B Q4 (42 GB) + Mixtral 8×7B Q4 (27 GB) = 69 GB. Doesn't fit in 44 GB. You'd need Q3 or IQ-quant variants. With OLLAMA_MAX_LOADED_MODELS=2 and aggressive quantization, technically yes; quality-wise marginal.
Q: How much does 4× 1080 Ti idle?
Per-card idle: ~30-50W. Four cards idle: 120-200W. Plus CPU + system: ~250-350W total idle. Not great. Modern 4090 is ~15W idle.
Q: Can I sell my 4× 1080 Ti build in 2 years?
1080 Ti residual value in 2028 is probably ₩100K each. Half of today's value. Plan for that.
Q: ROCm / AMD alternative — 4× MI50 32GB instead?
Used MI50 32GB cards exist (~₩300K each = ₩1.2M for 4 = 128GB VRAM!) but ROCm support for inference is bumpy compared to CUDA. llama.cpp supports ROCm but with more rough edges than CUDA. Worth considering if you're patient and want extreme VRAM cheaply.
Q: Why not Tesla P40 (24GB Pascal) instead?
P40 24GB at ~₩400K each = ₩1.6M for two = 48 GB combined. Less PCIe lanes needed (only 2 cards). Catch: P40 has no display output (compute-only), passive cooling (needs custom fan shroud), and used market is competitive. 2× P40 is the "more sophisticated" Pascal route; 4× 1080 Ti is the "scrappier" route.
Q: How long until this build is obsolete?
Already partially. Pascal lacks tensor cores (FP16 not faster), Flash Attention 2 (Ampere+), vLLM support. llama.cpp will likely maintain Pascal support through ~2027-2028 then taper. For 2-3 year hobby use, fine. For long-term investment, no.
Q: I have a Threadripper Pro WRX80 — should I just use four 3090s?
If budget allows, yes — 4× 3090 = 96 GB combined VRAM with vastly better per-card performance. ~₩4-5M total. But you're at "small-scale prosumer LLM lab" cost. 4× 1080 Ti is the entry-level version of that.
Closing — The One-Sentence Verdict
If you already own 2-3 GTX 1080 Tis and an HEDT motherboard, rounding up to four cards for ~₩400-500K total marginal cost gives you 44 GB of VRAM that runs Llama 70B at 8-12 t/s — the cheapest path to local 70B inference in 2026. If you're starting from zero, buy a used RTX 3090 instead unless you specifically need >24 GB combined for less than $1,500.
Related posts:
- Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition
- Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE
- GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M
- Ollama OLLAMA_KEEP_ALIVE — Model Memory Persistence Deep Dive
- Home AI Server Build Guide 2026: RTX 4090 vs 3090 vs 5090
- Best Ollama Models for RTX 3090 24GB in 2026
- LLM VRAM Calculator
References:
- llama.cpp multi-GPU documentation: https://github.com/ggerganov/llama.cpp
- NVIDIA Pascal GP102 whitepaper
- LocalLLaMA multi-GPU build threads (r/LocalLLaMA, 2024-2026)
- X299 / Threadripper Pro motherboard PCIe lane allocation references
관련 글
llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)
5월 23일 · 9 min read
일반GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 11 min read
일반Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)
5월 27일 · 10 min read
일반Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
5월 27일 · 13 min read