4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

Q: Can I mix 1080 Ti with another GPU (e.g., 3 × 1080 Ti + 1 × 3090)?

Yes for llama.cpp / Ollama. Use `--tensor-split` weights to allocate by VRAM (e.g., `--tensor-split 11,11,11,24` if 4th card is 3090). The mixed setup is bandwidth-limited by the slowest card, but more VRAM is more VRAM.

Q: Do mining-recovered 1080 Tis work for this?

Often yes, with caveats. Mining stresses VRAM modules and PCIe contacts. Test extensively: run `gpu-burn` for 6+ hours, watch for memory errors. Re-pad thermal pads if cards run hot. Prefer non-mining secondhand if available; mining cards typically cost ₩50K less for a reason.

Q: Why not just use cloud APIs at this cost?

Comparison: ₩2.1M one-time + ~₩150K/month electricity vs Claude API at ~₩200-300/month for hobby usage. For pure cost on small workloads, cloud wins. The 4× 1080 Ti makes sense if (a) you value privacy/local data, (b) you have heavy token volume (>5-10M/month), (c) you're learning multi-GPU systems engineering as a goal in itself.

4x GTX 1080 Ti build

Quick Answer (TL;DR)

Can you run Llama 3.1 70B on four GTX 1080 Tis? Yes, at ~8-12 tokens/sec with Q4_K_M quantization. Total build cost ~₩2.1M ($1,500) gives 44 GB combined VRAM — the cheapest path to local 70B inference in 2026.

Key requirements:

Motherboard: HEDT (X299, Threadripper Pro TRX40/WRX80) or server (Xeon, EPYC) for adequate PCIe lanes; consumer boards (Z690 etc.) max out at 2 GPUs
PSU: 1500W minimum (1000W for GPUs at sustained load + system + transient peaks)
Cooling: open-air mining frame or EATX case with strong airflow (1000W of GPU heat in confined space throttles standard mid-tower cases)
Software: llama.cpp / Ollama with --split-mode layer --tensor-split 11,11,11,11; do NOT use row split on PCIe-only Pascal (no NVLink); vLLM is incompatible with Pascal

This build is right when: you already own 2+ GTX 1080 Tis, you need 40+ GB VRAM at minimum cost, and you tolerate 1000W power draw + noise. It is wrong when: you're starting fresh (used RTX 3090 at ~₩1.1M beats it for single-card workflows), you need vLLM/Flash Attention 2 (Ampere+ only), or electricity cost is a 24/7 concern.

Definition

Multi-GPU local LLM inference on Pascal uses llama.cpp's tensor-split layer partitioning to spread a single model across multiple PCIe-connected GPUs without NVLink. Each GPU runs a subset of model layers; inter-GPU communication is limited to small hidden-state vectors per layer boundary (low bandwidth requirement). With four GTX 1080 Tis, total addressable VRAM is 44 GB — enough to run Llama 3.1 70B at Q4_K_M (~42 GB weights) or three medium models concurrently for multi-model serving.

The Math That Makes This Build Tempting

In 2026, the used GPU market has odd shapes. RTX 3090s hold value (~₩1.1M / $800), 4090s stay expensive, and Pascal-era cards keep falling. A used GTX 1080 Ti goes for about ₩200K-250K ($150-180) in Korea. Multiply by four:

Hardware	Approx Cost (2026)	Combined VRAM
4× used GTX 1080 Ti	~₩900K + frame + PSU	44 GB
1× used RTX 3090	₩1.1-1.3M	24 GB
2× used RTX 3090 + NVLink-less	₩2.3M	48 GB
1× new RTX 4090	~₩2.5M	24 GB
1× new RTX 5090	~₩3.5-4.5M	32 GB

The 4× 1080 Ti is the cheapest way to put >40 GB of VRAM on a single Linux box. That's enough to fit:

Llama 3.1 70B Q4_K_M (~42 GB) — just barely
Mixtral 8×7B Q5_K_M (~32 GB) — with room
Qwen3.6-35B-A3B Q4_K_M (~21 GB) — with massive context budget
2-3 medium models concurrently for multi-model serving

This guide is the practical end-to-end build — hardware, software, and the limits you'll hit. Built on the Running Modern LLMs on GTX 1080 Ti in 2026 baseline, extended to four cards.

Why This Is Genuinely Hard

Before the parts list, the honest constraints:

No NVLink — Pascal consumer cards never supported it. All inter-GPU traffic crosses PCIe lanes
PCIe lane allocation — most consumer boards can't give four GPUs adequate lanes
1000W of GPU power — needs serious PSU + cooling + ventilation
Physical fit — four dual-slot cards = 8 PCIe slots' worth of physical space
Software ecosystem aging — modern libraries (vLLM, Flash Attention 2) require Ampere+

If you read those and shrug, continue. If any seem alarming, the single RTX 3090 path is almost certainly better.

Hardware — What Actually Works

Motherboard / Platform (the critical choice)

Consumer mainstream boards (Z690, B650 etc.) max out at 2 GPUs with usable lanes. For four cards you need one of:

Option A — Used HEDT (best price/value)

X299 + Intel Core X-Series (i9-7900X+, i9-10900X): 28-44 PCIe 3.0 lanes from CPU, supports x16/x16/x8/x8 or x16/x8/x8/x8 depending on board. Used X299 boards ~₩200K, CPU ~₩150-250K.
Specific recommendations: ASUS ROG Strix X299-E Gaming II, MSI X299 Pro 10G

Option B — Threadripper Pro (best lanes, expensive)

TRX40 / WRX80 boards with Threadripper Pro: 128 PCIe 4.0 lanes → x16/x16/x16/x16 trivial
Used: ~₩800K-1.5M for board + CPU combo
Massive overkill but the cleanest 4-GPU PCIe topology

Option C — Old server hardware (cheapest, hardest)

Dual Xeon E5-2680v4 + SuperMicro X10DRi board: 80 PCIe 3.0 lanes
Used pricing: ~₩300-500K for board + 2 CPUs
Requires server case / open-air, deal with IPMI BIOS quirks

Option D — Mining rig with risers (avoid for LLM)

Cheap motherboards (B250 etc.) with 6 PCIe slots via x1 risers
Don't do this for LLM inference: x1 risers are 1/16 the bandwidth → multi-GPU inference crawls

For most builders, used X299 is the sweet spot. The PCIe 3.0 generation matches Pascal cards perfectly (1080 Ti is PCIe 3.0 x16).

Power Supply

Each 1080 Ti pulls up to 250W under sustained inference load (250W TDP, less in practice but plan for max). System overhead: CPU 100-200W, motherboard + storage + fans 50-100W.

Total budget for 4-GPU under load:

4 × 250W (GPUs)  = 1000W
CPU              = ~150W
Other            = ~100W
                  ------
Sustained load   = ~1250W
Peak transient   = ~1400W (during model loads)

PSU sizing: 1500W minimum, 1600W recommended.

Options:

Corsair AX1600i (~₩600K used / ₩900K new): platinum efficient, modular, proven
EVGA SuperNOVA 1600 P+ (~₩500K used): great if you find one
Server PSU adapters (mining-style): cheap (~₩100K total for 2× 1000W server PSUs + breakout boards), but DIY wiring concerns. Acceptable for hobby builds, not for "set and forget"

A 1200W PSU will trip when all 4 GPUs are doing peak load simultaneously. Don't risk it.

Cooling

1000W of GPU heat in a confined space is the realistic challenge. Three approaches:

Open-air mining frame (recommended for 4-GPU)

Aluminum frame, ~30K KRW
GPUs sit horizontal with ~3-4cm spacing
3-4 case fans blowing across
Room ventilation: open window or HVAC essential
Noise: loud (multiple GPU fans at high RPM)

Server case with 4-slot spacing

Some EATX cases (Phanteks Enthoo Pro, Corsair Air 540) fit four dual-slot cards
Better aesthetics, similar thermal performance with good front intake fans
Cost: ₩200-300K

Avoid: standard mid-tower cases. Four dual-slot GPUs cooks itself — top cards will hit 90°C+ and throttle.

Ambient temperature: a closed room with 4× 1080 Ti at full load rises 5-8°C per hour. AC or constant ventilation is non-negotiable for sustained inference.

Other components

Storage: 1TB NVMe minimum. Models are large (Llama 70B Q4 = 42 GB). 2TB recommended if keeping multiple quants.
RAM: 64-128GB. llama.cpp's CPU offload uses system RAM heavily during model loads.
CPU: any 8+ core modern CPU works. Used i9-9900K, Ryzen 9 5900X both fine.

Total Build Cost Estimate (mid-range, used parts where sensible)

Component	Approx ₩	Notes
4× GTX 1080 Ti (used)	900K	Non-mining where possible
X299 motherboard (used)	200K	ASUS ROG Strix X299-E Gaming II
Intel i9-10900X (used)	200K	10-core, 44 PCIe lanes
64GB DDR4-3200	150K	4× 16GB
1TB NVMe SSD	100K	Samsung 980 Pro
Corsair AX1600i PSU (used)	500K	Or 2× server PSU = 200K
Mining frame + 4 case fans	50K
Subtotal	₩2.1M

Compare to:

1× new RTX 4090 + decent system: ₩3.5M (24 GB VRAM)
1× new RTX 5090 + decent system: ₩4.5M (32 GB VRAM)
1× used RTX 3090 + decent system: ₩2.0M (24 GB VRAM)

4× 1080 Ti is competitive with single RTX 3090 system pricing but gives ~2× the VRAM. That's the whole pitch.

Software Setup

Driver and CUDA

# Ubuntu 22.04 / 24.04 LTS recommended
sudo apt install nvidia-driver-555 nvidia-cuda-toolkit  # Or newer if available
sudo reboot
nvidia-smi   # Verify all 4 cards visible

You should see four GPUs at PCIe 3.0 x16 or x8 (depending on board). If any show as x4 or x1, recheck slot allocation in BIOS.

Verify PCIe configuration

nvidia-smi --query-gpu=index,name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv

Target: all 4 GPUs at gen 3, width 8 or 16. If you see gen 1 x4, you have a riser or BIOS issue.

Install llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j16

Or use the prebuilt binaries from the releases page.

First multi-GPU test

# Download Mixtral 8×7B Q5_K_M (~32 GB) — fits comfortably
huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf

# Run with explicit 4-way split
./llama-cli -m mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf \
  --threads 16 \
  --ctx-size 8192 \
  -ngl 99 \
  --split-mode layer \
  --tensor-split 11,11,11,11 \
  --flash-attn \
  -p "Hello, give me a brief introduction to yourself."

In another terminal, watch all 4 cards engage:

watch -n 0.5 'nvidia-smi --query-gpu=index,memory.used,utilization.gpu --format=csv'

You should see ~8 GB on each card and ~30-60% utilization. If one card is at 0%, your split or -ngl flag didn't take effect.

Ollama for friendlier multi-GPU

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Configure for 4-GPU spread
sudo systemctl edit ollama

Add:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KEEP_ALIVE=24h"

Restart: sudo systemctl daemon-reload && sudo systemctl restart ollama

Then:

ollama pull llama3.1:70b-instruct-q4_K_M
ollama run llama3.1:70b "Explain attention mechanism in transformers."

Detailed Ollama multi-GPU tuning (with focus on dual setup): Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti. The principles extend to 4-GPU; just set 4 weights in tensor-split.

Expected Throughput — Real-World Estimates

Numbers below are estimates extrapolated from 2-GPU measurements + community 4-GPU benchmarks. Single-card baseline used as anchor.

Llama 3.1 70B Q4_K_M (~42 GB)

Configuration	tokens/sec	Notes
Single 1080 Ti	OOM (only 11 GB)	Won't load
2× 1080 Ti (Ollama spread)	OOM	Still ~22 GB combined < 42
4× 1080 Ti (layer split)	8-12	Just fits, PCIe bottleneck dominates
Single RTX 3090 (Q3_K_M)	18-22	Lower quant required
2× RTX 3090 NVLink-less	16-22 (Q4)	NVLink wouldn't exist on consumer 3090s, but ~2x bandwidth helps

The 4× 1080 Ti running Llama 70B is the cheapest way to run a full 70B at Q4 as of 2026. Throughput is modest but functional for interactive use (8-12 t/s = ~500-700 tokens/min, fine for chat).

Mixtral 8×7B Q5_K_M (~32 GB)

Configuration	tokens/sec	Notes
Single 1080 Ti	OOM
2× 1080 Ti	OOM at Q5 (fits Q4)
4× 1080 Ti (layer split)	18-25	Comfortable fit, headroom for ctx
Single RTX 3090 (Q4_K_M, ~24 GB)	50-65	RTX 3090 wins single-card MoE

For Mixtral, single 3090 is better. 4× 1080 Ti's edge is enabling higher quant (Q5 vs Q4) and more context.

Qwen3.6-35B-A3B Q4_K_M (~21 GB)

Configuration	tokens/sec	Notes
Single 1080 Ti	OOM
2× 1080 Ti	25-35	Fits comfortably
4× 1080 Ti	30-40	Diminishing returns; PCIe overhead grows
Single RTX 3090	50-65	Fastest, no split overhead

For Qwen3.6 specifically, single 3090 outperforms 4× 1080 Ti by ~2×. Use 4× 1080 Ti only for models the 3090 can't fit. See Running Qwen3.6-35B-A3B on RTX 3090.

Multi-model concurrent serving

44 GB enables 2-3 medium models simultaneously:

44 GB total
- Llama 3.1 8B Q8_0 (~9 GB)  → leaves 35 GB
- DeepSeek-Coder 14B Q4_K_M (~8.5 GB) → leaves 26.5 GB
- Qwen3.6-35B-A3B Q3_K_M (~14 GB) → leaves 12.5 GB
                                     ↑
                                  Tight but viable for KV cache

With Ollama's OLLAMA_MAX_LOADED_MODELS=3 and OLLAMA_KEEP_ALIVE=24h, you can host three specialized models always-loaded. This is the genuine niche for the 4-card build: a personal multi-model server for varied workloads.

See Ollama OLLAMA_KEEP_ALIVE — Model Memory Persistence Deep Dive for multi-model scheduling.

PCIe Lane Reality — The Hidden Bottleneck

On X299 with i9-10900X (44 lanes), typical 4-GPU allocation:

Slot 1: PCIe 3.0 x16 (full bandwidth)
Slot 2: PCIe 3.0 x8
Slot 3: PCIe 3.0 x8
Slot 4: PCIe 3.0 x8

Layer-split inference does NOT need all-GPUs-at-x16. The traffic per layer is small (~kilobytes), and llama.cpp's pipeline keeps GPUs busy with their assigned layers between PCIe transfers.

But for row split (which all-reduces activations every layer), the lower-bandwidth GPUs become bottlenecks. Always use --split-mode layer on PCIe-only multi-GPU.

For row vs layer details: llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition.

When 4× 1080 Ti Is the Right Build

Genuinely yes if:

You already own 2+ 1080 Tis and adding 2 more costs ~₩500K
You need 40+ GB VRAM at minimum cost
You have HEDT board / spare PCIe lanes / Threadripper Pro
You're doing research / hobby, not production (electricity matters less)
You want to host 2-3 medium models simultaneously
Llama 70B occasional use justifies the build

When 4× 1080 Ti Is the Wrong Build

Honestly, more cases than people admit:

You're optimizing for tokens/sec: single RTX 3090 ($1.1M used) beats 4× 1080 Ti on every model that fits in 24 GB
24/7 operation matters: 1000W × 24h × 365d × ₩200/kWh = ~₩1.75M/year in electricity. Vs a 350W RTX 3090 at ~₩600K/year. Pays back the 3090 in <2 years.
Production deployment: vLLM (Volta+), modern serving frameworks, none support Pascal. You're stuck with llama.cpp / Ollama.
Fine-tuning: Pascal can do QLoRA on smaller models slowly. Full fine-tuning on 70B requires modern hardware.
Noise / room constraints: 1000W in a closed home office is brutal. Cards screaming at 100% fan ≈ hairdryer noise level.

For most people in 2026 starting fresh: a single used RTX 3090 ($800-$900) is a saner choice than a 4× 1080 Ti rig. The 4-card build is for people specifically targeting Llama 70B+ at minimum capital cost, accepting the operational tradeoffs.

FAQ

Q: Can I mix 1080 Ti with another GPU (e.g., 3 × 1080 Ti + 1 × 3090)?

Yes for llama.cpp / Ollama. Use --tensor-split weights to allocate by VRAM (e.g., --tensor-split 11,11,11,24 if 4th card is 3090). The mixed setup is bandwidth-limited by the slowest card, but more VRAM is more VRAM.

Q: Do mining-recovered 1080 Tis work for this?

Often yes, with caveats. Mining stresses VRAM modules and PCIe contacts. Test extensively: run gpu-burn for 6+ hours, watch for memory errors. Re-pad thermal pads if cards run hot. Prefer non-mining secondhand if available; mining cards typically cost ₩50K less for a reason.

Q: Why not just use cloud APIs at this cost?

Comparison: ₩2.1M one-time + ~₩150K/month electricity vs Claude API at ~₩200-300/month for hobby usage. For pure cost on small workloads, cloud wins. The 4× 1080 Ti makes sense if (a) you value privacy/local data, (b) you have heavy token volume (>5-10M/month), (c) you're learning multi-GPU systems engineering as a goal in itself.

Q: Can I run two simultaneous large models (Llama 70B + Mixtral)?

Llama 70B Q4 (42 GB) + Mixtral 8×7B Q4 (27 GB) = 69 GB. Doesn't fit in 44 GB. You'd need Q3 or IQ-quant variants. With OLLAMA_MAX_LOADED_MODELS=2 and aggressive quantization, technically yes; quality-wise marginal.

Q: How much does 4× 1080 Ti idle?

Per-card idle: ~30-50W. Four cards idle: 120-200W. Plus CPU + system: ~250-350W total idle. Not great. Modern 4090 is ~15W idle.

Q: Can I sell my 4× 1080 Ti build in 2 years?

1080 Ti residual value in 2028 is probably ₩100K each. Half of today's value. Plan for that.

Q: ROCm / AMD alternative — 4× MI50 32GB instead?

Used MI50 32GB cards exist (~₩300K each = ₩1.2M for 4 = 128GB VRAM!) but ROCm support for inference is bumpy compared to CUDA. llama.cpp supports ROCm but with more rough edges than CUDA. Worth considering if you're patient and want extreme VRAM cheaply.

Q: Why not Tesla P40 (24GB Pascal) instead?

P40 24GB at ~₩400K each = ₩1.6M for two = 48 GB combined. Less PCIe lanes needed (only 2 cards). Catch: P40 has no display output (compute-only), passive cooling (needs custom fan shroud), and used market is competitive. 2× P40 is the "more sophisticated" Pascal route; 4× 1080 Ti is the "scrappier" route.

Q: How long until this build is obsolete?

Already partially. Pascal lacks tensor cores (FP16 not faster), Flash Attention 2 (Ampere+), vLLM support. llama.cpp will likely maintain Pascal support through ~2027-2028 then taper. For 2-3 year hobby use, fine. For long-term investment, no.

Q: I have a Threadripper Pro WRX80 — should I just use four 3090s?

If budget allows, yes — 4× 3090 = 96 GB combined VRAM with vastly better per-card performance. ~₩4-5M total. But you're at "small-scale prosumer LLM lab" cost. 4× 1080 Ti is the entry-level version of that.

Closing — The One-Sentence Verdict

If you already own 2-3 GTX 1080 Tis and an HEDT motherboard, rounding up to four cards for ~₩400-500K total marginal cost gives you 44 GB of VRAM that runs Llama 70B at 8-12 t/s — the cheapest path to local 70B inference in 2026. If you're starting from zero, buy a used RTX 3090 instead unless you specifically need >24 GB combined for less than $1,500.

Related posts:

References:

llama.cpp multi-GPU documentation: https://github.com/ggerganov/llama.cpp
NVIDIA Pascal GP102 whitepaper
LocalLLaMA multi-GPU build threads (r/LocalLLaMA, 2024-2026)
X299 / Threadripper Pro motherboard PCIe lane allocation references