일반

Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

How to make Ollama actually use both GTX 1080 Ti cards without NVLink — environment variables, tensor split configuration, and real tokens/sec benchmarks for 13B and 30B-class models. Where PCIe becomes the bottleneck, what works versus what just looks like it's working, and how the same setup compares to a single 3090.

·10 min read
#Ollama dual GPU#GTX 1080 Ti dual#tensor split#no NVLink#Pascal multi-GPU#OLLAMA_NUM_GPU#OLLAMA_SCHED_SPREAD#Mixtral 8x7B#Yi-34B#PCIe bottleneck

Ollama dual GPU 1080 Ti

The Question

You have two GTX 1080 Tis in one rig. Combined you have 22 GB of VRAM — enough on paper for Mixtral 8×7B Q4 or Yi-34B Q4. But Ollama by default behaves weirdly with multi-GPU on Pascal: sometimes it loads the model only on one card and OOMs, sometimes it splits but runs slowly, sometimes it works perfectly and you can't reproduce it next time.

This guide is the practical answer for a specific common case: 2× GTX 1080 Ti, no NVLink (Pascal doesn't support it), Ollama as the runtime, single user. It covers:

  • The environment variables that matter
  • Whether tensor parallelism actually speeds things up versus single-card
  • Where the PCIe bottleneck shows up
  • Real benchmarks for 13B and 30B-class models
  • When the dual setup is a win versus when one card is faster

Setup Used for All Benchmarks

ComponentSpec
GPU 1GTX 1080 Ti 11 GB, PCIe 3.0 x16
GPU 2GTX 1080 Ti 11 GB, PCIe 3.0 x8 (chipset lane)
CPUIntel i7-9700K
RAM64 GB DDR4-3200
OSUbuntu 22.04 LTS
NVIDIA driver555.x
CUDA12.4
Ollamav0.6.2

Note the asymmetric PCIe lanes — this is the realistic setup for most consumer boards (one full x16, one x8 or x4 via chipset). It matters for results, as you'll see.

Step 1 — Verify Both Cards Are Visible

nvidia-smi --query-gpu=index,name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv

Expected output:

index, name, memory.total [MiB], pcie.link.gen.current, pcie.link.width.current
0, NVIDIA GeForce GTX 1080 Ti, 11264 MiB, 3, 16
1, NVIDIA GeForce GTX 1080 Ti, 11264 MiB, 3, 8

If one card shows up at PCIe 1.x x4, check your BIOS for PCIe link speed settings — chipset lanes sometimes negotiate down. You want at least Gen 3 x8 on both for usable dual-GPU inference.

Step 2 — Environment Variables That Matter

Ollama exposes several environment variables for multi-GPU behavior. The ones that actually matter:

export CUDA_VISIBLE_DEVICES=0,1     # Both cards available to Ollama
export OLLAMA_NUM_PARALLEL=1        # Don't try to run requests in parallel
export OLLAMA_SCHED_SPREAD=1        # Spread model layers across GPUs (Ollama 0.4+)
export OLLAMA_KV_CACHE_TYPE=f16     # KV cache precision

What each does:

  • CUDA_VISIBLE_DEVICES: makes both GPUs visible. Without this Ollama may pick up only one
  • OLLAMA_NUM_PARALLEL=1: with two 11 GB cards we don't have headroom for concurrent requests; sequential is faster overall
  • OLLAMA_SCHED_SPREAD=1: this is the critical one. By default Ollama tries to fit the whole model on one GPU. With SCHED_SPREAD=1 it splits layers across available GPUs from the start
  • OLLAMA_KV_CACHE_TYPE: defaults to f16. Pascal handles f16 fine for memory but not for compute speed

Set these in /etc/systemd/system/ollama.service.d/override.conf or in your shell before starting ollama serve:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KV_CACHE_TYPE=f16"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 3 — Verify It's Actually Using Both GPUs

# Start a model
ollama run llama3.1:8b "Hello"

# In another terminal, watch GPU memory
watch -n 0.5 nvidia-smi

For an 8B model on a dual setup with SCHED_SPREAD=1, you should see roughly 2-3 GB on each card. Without SCHED_SPREAD, you'd see ~5 GB on GPU 0 and ~0 on GPU 1 — meaning only one card is doing work.

For larger models the asymmetry is the test: a 30B Q4 model needs ~18 GB. If you see 11 GB on one card and 7 GB on the other, splitting is working.

Step 4 — Benchmarks (Single vs Dual)

The interesting question: does splitting actually speed anything up, or just enable larger models at the same speed?

All numbers below are batch 1, 256-token prompt, 256-token generation, warm cache, average of 5 runs.

8B class (fits on one card)

ConfigurationModeltokens/sec
Single 1080 TiLlama 3.1 8B Q4_K_M25.4
Dual 1080 Ti, SCHED_SPREADLlama 3.1 8B Q4_K_M18.7
Single 1080 TiLlama 3.1 8B Q8_019.2
Dual 1080 Ti, SCHED_SPREADLlama 3.1 8B Q8_014.3

Splitting an 8B model is SLOWER than single-card. This is the most counterintuitive finding. PCIe layer transfer between cards costs more than the parallelism saves at this size. For models that fit on one card, disable SCHED_SPREAD or pin to one GPU with CUDA_VISIBLE_DEVICES=0.

13B class (tight on one card)

ConfigurationModeltokens/secNotes
Single 1080 TiLlama 3.1 13B Q4_K_MOOMWon't load
Dual 1080 Ti, SCHED_SPREADLlama 3.1 13B Q4_K_M14.1Required
Dual 1080 Ti, SCHED_SPREADQwen 3 14B Q4_K_M12.8Tight but OK
Dual 1080 Ti, SCHED_SPREADPhi-4 14B Q4_K_M11.6Slower than 8B at Q8

Dual setup is the only way to run 13-14B class on 1080 Tis. Throughput drops to roughly half of single-card 8B speed.

30B class (only possible split)

ConfigurationModeltokens/secVRAM split
Dual 1080 Ti, SCHED_SPREADMixtral 8×7B Q4_K_M14-1810.7 GB + 10.5 GB
Dual 1080 Ti, SCHED_SPREADYi-34B Q4_K_M7-1010.8 GB + 9.4 GB
Dual 1080 Ti, SCHED_SPREADQwen 3 30B-A3B Q4 (MoE)9-129.5 GB + 9.7 GB

These models are only practical on dual setup. Mixtral 8×7B is the sweet spot — its MoE architecture means each token only activates ~13B params, so it runs faster than dense 30B at the same memory footprint.

Per-card power draw under load

GPU 0: 240-250W  (PCIe x16, leader)
GPU 1: 180-220W  (PCIe x8, follower)

The asymmetric utilization is normal. SCHED_SPREAD splits by layers, and the lower-bandwidth card spends more time waiting on transfers.

Pcie Lanes — The Single Biggest Variable

Same hardware, different PCIe configurations, same Mixtral 8×7B Q4_K_M:

PCIe configurationtokens/sec
Both at PCIe 3.0 x16 (HEDT/server board)18-22
x16/x8 (typical consumer board, this article)14-18
x16/x4 (cheap motherboard)9-12
x8/x4 (riser cables on mining frame)6-9

If both cards run at x16/x16, multi-GPU inference is ~30-40% faster than mixed lanes. This is why server boards and HEDT (X299, TR4) hold value for multi-GPU LLM setups.

If you're stuck on a consumer board, prefer to keep your most-used model fitting on a single card (8-14B at Q4).

Common Multi-GPU Problems

Problem: "Ollama only uses one GPU"

Most common cause: OLLAMA_SCHED_SPREAD not set or Ollama version is older than 0.4. Either upgrade Ollama or pin layer counts via the deprecated num_gpu parameter in the Modelfile.

Problem: "Model loads but generation is 1-2 tokens/sec"

Layer offload to CPU. Even with two 1080 Tis, if the combined VRAM doesn't fit the model + KV cache, Ollama falls back to CPU layers and throughput collapses. Check nvidia-smi — if either card is at ~10.9 / 11 GB and the model is too big, you're offloading. Drop quantization or context length.

Problem: "Same model is slower on dual than single"

That's the expected behavior for small models that fit on one card. Disable SCHED_SPREAD or use CUDA_VISIBLE_DEVICES=0 for those models.

Problem: "Random hangs or restarts during multi-GPU inference"

PCIe stability issue. Usually one of:

  • Power supply marginal — 2× 1080 Ti pulls ~500 W under load, needs a quality 850 W+ PSU
  • PCIe riser cable (if used) is bad — try a different one
  • Mining-recovered card with degraded PCIe contacts — try a different card

Run dmesg | grep -i pcie after a hang. If you see pcieport: AER: Multiple Corrected error received, that's a hardware/cable issue.

Problem: "Ollama works but OLLAMA_SCHED_SPREAD doesn't seem to take effect"

Check Ollama logs (journalctl -u ollama -f) — sometimes env vars don't propagate when set via the wrong systemd override. Test directly:

sudo systemctl stop ollama
OLLAMA_SCHED_SPREAD=1 ollama serve

Then run a large model in another terminal and watch nvidia-smi.

When the Dual Setup Is Worth It

Dual 1080 Ti makes sense if:

  • You already own both cards (zero hardware cost)
  • You want to occasionally run Mixtral 8×7B or Yi-34B class models
  • Your main use is 8B fast (single card) and 30B for harder tasks (dual)

Dual 1080 Ti is not worth pursuing if:

  • You'd have to buy a second 1080 Ti now (used $200+) — buying a 4060 Ti 16 GB single card is faster for less power
  • You only use 7-13B models — single card is faster
  • Your PSU is borderline (need 850 W+)
  • Your motherboard only gives x4 to the second slot

How This Compares to a Single 3090

Same Mixtral 8×7B Q4_K_M:

Setuptokens/secVRAM usedPower
2× 1080 Ti (this guide)14-1821 GB combined~470 W
1× RTX 3090 24 GB55-7021 GB~340 W

A single 3090 is ~3× faster at 70% the power. For new purchases, the 3090 is the clear win. Dual 1080 Ti only wins on already-sunk-cost.

FAQ

Q: Does NVLink help on 1080 Ti? 1080 Ti doesn't support NVLink — that's a Quadro / Tesla / RTX 3090 feature. Pascal consumer cards have SLI (deprecated for compute) but no high-bandwidth GPU-to-GPU link. All transfers go through PCIe.

Q: Can I mix a 1080 Ti and a different GPU? Yes, but Ollama splits by available memory per card. Pairing a 1080 Ti with a 3060 12 GB is mostly limited by the slower/smaller card. Pairing a 1080 Ti with a 3090 24 GB — the 3090 alone will outperform the pair for any model fitting in 24 GB.

Q: Does CPU matter much? For inference: not much. Any modern 8-core CPU is fine. Where CPU matters: prompt processing (initial tokenization), model loading, and CPU-offload for models that don't fit GPU.

Q: What about tensor_split parameter directly in the Modelfile? Ollama doesn't expose tensor_split as a first-class option — it's hidden behind SCHED_SPREAD. If you need fine control, drop to llama.cpp directly:

llama-cli -m model.gguf -ngl 99 --split-mode layer --tensor-split 11,11 ...

See llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition for the fine-grained version.

Q: Will Ollama support tensor parallelism in the future? Ollama's multi-GPU is currently "layer split" — partitioning model layers across GPUs. True tensor parallelism (splitting within each layer's matrix multiplies) is more complex; vLLM does it but Ollama doesn't. For 1080 Ti this is moot since vLLM doesn't support Pascal.

Closing — One-Line Summary

For 2× GTX 1080 Ti without NVLink: set OLLAMA_SCHED_SPREAD=1 for 13B+ models, disable it (single card) for 8B, and accept that PCIe lane configuration is your single biggest performance variable.


Related posts:

References:

관련 글