Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

Ollama dual GPU 1080 Ti

The Question

You have two GTX 1080 Tis in one rig. Combined you have 22 GB of VRAM — enough on paper for Mixtral 8×7B Q4 or Yi-34B Q4. But Ollama by default behaves weirdly with multi-GPU on Pascal: sometimes it loads the model only on one card and OOMs, sometimes it splits but runs slowly, sometimes it works perfectly and you can't reproduce it next time.

This guide is the practical answer for a specific common case: 2× GTX 1080 Ti, no NVLink (Pascal doesn't support it), Ollama as the runtime, single user. It covers:

The environment variables that matter
Whether tensor parallelism actually speeds things up versus single-card
Where the PCIe bottleneck shows up
Real benchmarks for 13B and 30B-class models
When the dual setup is a win versus when one card is faster

Setup Used for All Benchmarks

Component	Spec
GPU 1	GTX 1080 Ti 11 GB, PCIe 3.0 x16
GPU 2	GTX 1080 Ti 11 GB, PCIe 3.0 x8 (chipset lane)
CPU	Intel i7-9700K
RAM	64 GB DDR4-3200
OS	Ubuntu 22.04 LTS
NVIDIA driver	555.x
CUDA	12.4
Ollama	v0.6.2

Note the asymmetric PCIe lanes — this is the realistic setup for most consumer boards (one full x16, one x8 or x4 via chipset). It matters for results, as you'll see.

Step 1 — Verify Both Cards Are Visible

nvidia-smi --query-gpu=index,name,memory.total,pcie.link.gen.current,pcie.link.width.current --format=csv

Expected output:

index, name, memory.total [MiB], pcie.link.gen.current, pcie.link.width.current
0, NVIDIA GeForce GTX 1080 Ti, 11264 MiB, 3, 16
1, NVIDIA GeForce GTX 1080 Ti, 11264 MiB, 3, 8

If one card shows up at PCIe 1.x x4, check your BIOS for PCIe link speed settings — chipset lanes sometimes negotiate down. You want at least Gen 3 x8 on both for usable dual-GPU inference.

Step 2 — Environment Variables That Matter

Ollama exposes several environment variables for multi-GPU behavior. The ones that actually matter:

export CUDA_VISIBLE_DEVICES=0,1     # Both cards available to Ollama
export OLLAMA_NUM_PARALLEL=1        # Don't try to run requests in parallel
export OLLAMA_SCHED_SPREAD=1        # Spread model layers across GPUs (Ollama 0.4+)
export OLLAMA_KV_CACHE_TYPE=f16     # KV cache precision

What each does:

CUDA_VISIBLE_DEVICES: makes both GPUs visible. Without this Ollama may pick up only one
OLLAMA_NUM_PARALLEL=1: with two 11 GB cards we don't have headroom for concurrent requests; sequential is faster overall
OLLAMA_SCHED_SPREAD=1: this is the critical one. By default Ollama tries to fit the whole model on one GPU. With SCHED_SPREAD=1 it splits layers across available GPUs from the start
OLLAMA_KV_CACHE_TYPE: defaults to f16. Pascal handles f16 fine for memory but not for compute speed

Set these in /etc/systemd/system/ollama.service.d/override.conf or in your shell before starting ollama serve:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_KV_CACHE_TYPE=f16"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 3 — Verify It's Actually Using Both GPUs

# Start a model
ollama run llama3.1:8b "Hello"

# In another terminal, watch GPU memory
watch -n 0.5 nvidia-smi

For an 8B model on a dual setup with SCHED_SPREAD=1, you should see roughly 2-3 GB on each card. Without SCHED_SPREAD, you'd see ~5 GB on GPU 0 and ~0 on GPU 1 — meaning only one card is doing work.

For larger models the asymmetry is the test: a 30B Q4 model needs ~18 GB. If you see 11 GB on one card and 7 GB on the other, splitting is working.

Step 4 — Benchmarks (Single vs Dual)

The interesting question: does splitting actually speed anything up, or just enable larger models at the same speed?

All numbers below are batch 1, 256-token prompt, 256-token generation, warm cache, average of 5 runs.

8B class (fits on one card)

Configuration	Model	tokens/sec
Single 1080 Ti	Llama 3.1 8B Q4_K_M	25.4
Dual 1080 Ti, SCHED_SPREAD	Llama 3.1 8B Q4_K_M	18.7 ⬇
Single 1080 Ti	Llama 3.1 8B Q8_0	19.2
Dual 1080 Ti, SCHED_SPREAD	Llama 3.1 8B Q8_0	14.3 ⬇

Splitting an 8B model is SLOWER than single-card. This is the most counterintuitive finding. PCIe layer transfer between cards costs more than the parallelism saves at this size. For models that fit on one card, disable SCHED_SPREAD or pin to one GPU with CUDA_VISIBLE_DEVICES=0.

13B class (tight on one card)

Configuration	Model	tokens/sec	Notes
Single 1080 Ti	Llama 3.1 13B Q4_K_M	OOM	Won't load
Dual 1080 Ti, SCHED_SPREAD	Llama 3.1 13B Q4_K_M	14.1	Required
Dual 1080 Ti, SCHED_SPREAD	Qwen 3 14B Q4_K_M	12.8	Tight but OK
Dual 1080 Ti, SCHED_SPREAD	Phi-4 14B Q4_K_M	11.6	Slower than 8B at Q8

Dual setup is the only way to run 13-14B class on 1080 Tis. Throughput drops to roughly half of single-card 8B speed.

30B class (only possible split)

Configuration	Model	tokens/sec	VRAM split
Dual 1080 Ti, SCHED_SPREAD	Mixtral 8×7B Q4_K_M	14-18	10.7 GB + 10.5 GB
Dual 1080 Ti, SCHED_SPREAD	Yi-34B Q4_K_M	7-10	10.8 GB + 9.4 GB
Dual 1080 Ti, SCHED_SPREAD	Qwen 3 30B-A3B Q4 (MoE)	9-12	9.5 GB + 9.7 GB

These models are only practical on dual setup. Mixtral 8×7B is the sweet spot — its MoE architecture means each token only activates ~13B params, so it runs faster than dense 30B at the same memory footprint.

Per-card power draw under load

GPU 0: 240-250W  (PCIe x16, leader)
GPU 1: 180-220W  (PCIe x8, follower)

The asymmetric utilization is normal. SCHED_SPREAD splits by layers, and the lower-bandwidth card spends more time waiting on transfers.

Pcie Lanes — The Single Biggest Variable

Same hardware, different PCIe configurations, same Mixtral 8×7B Q4_K_M:

PCIe configuration	tokens/sec
Both at PCIe 3.0 x16 (HEDT/server board)	18-22
x16/x8 (typical consumer board, this article)	14-18
x16/x4 (cheap motherboard)	9-12
x8/x4 (riser cables on mining frame)	6-9

If both cards run at x16/x16, multi-GPU inference is ~30-40% faster than mixed lanes. This is why server boards and HEDT (X299, TR4) hold value for multi-GPU LLM setups.

If you're stuck on a consumer board, prefer to keep your most-used model fitting on a single card (8-14B at Q4).

Common Multi-GPU Problems

Problem: "Ollama only uses one GPU"

Most common cause: OLLAMA_SCHED_SPREAD not set or Ollama version is older than 0.4. Either upgrade Ollama or pin layer counts via the deprecated num_gpu parameter in the Modelfile.

Problem: "Model loads but generation is 1-2 tokens/sec"

Layer offload to CPU. Even with two 1080 Tis, if the combined VRAM doesn't fit the model + KV cache, Ollama falls back to CPU layers and throughput collapses. Check nvidia-smi — if either card is at ~10.9 / 11 GB and the model is too big, you're offloading. Drop quantization or context length.

Problem: "Same model is slower on dual than single"

That's the expected behavior for small models that fit on one card. Disable SCHED_SPREAD or use CUDA_VISIBLE_DEVICES=0 for those models.

Problem: "Random hangs or restarts during multi-GPU inference"

PCIe stability issue. Usually one of:

Power supply marginal — 2× 1080 Ti pulls ~500 W under load, needs a quality 850 W+ PSU
PCIe riser cable (if used) is bad — try a different one
Mining-recovered card with degraded PCIe contacts — try a different card

Run dmesg | grep -i pcie after a hang. If you see pcieport: AER: Multiple Corrected error received, that's a hardware/cable issue.

Problem: "Ollama works but `OLLAMA_SCHED_SPREAD` doesn't seem to take effect"

Check Ollama logs (journalctl -u ollama -f) — sometimes env vars don't propagate when set via the wrong systemd override. Test directly:

sudo systemctl stop ollama
OLLAMA_SCHED_SPREAD=1 ollama serve

Then run a large model in another terminal and watch nvidia-smi.

When the Dual Setup Is Worth It

Dual 1080 Ti makes sense if:

You already own both cards (zero hardware cost)
You want to occasionally run Mixtral 8×7B or Yi-34B class models
Your main use is 8B fast (single card) and 30B for harder tasks (dual)

Dual 1080 Ti is not worth pursuing if:

You'd have to buy a second 1080 Ti now (used $200+) — buying a 4060 Ti 16 GB single card is faster for less power
You only use 7-13B models — single card is faster
Your PSU is borderline (need 850 W+)
Your motherboard only gives x4 to the second slot

How This Compares to a Single 3090

Same Mixtral 8×7B Q4_K_M:

Setup	tokens/sec	VRAM used	Power
2× 1080 Ti (this guide)	14-18	21 GB combined	~470 W
1× RTX 3090 24 GB	55-70	21 GB	~340 W

A single 3090 is ~3× faster at 70% the power. For new purchases, the 3090 is the clear win. Dual 1080 Ti only wins on already-sunk-cost.

FAQ

Q: Does NVLink help on 1080 Ti? 1080 Ti doesn't support NVLink — that's a Quadro / Tesla / RTX 3090 feature. Pascal consumer cards have SLI (deprecated for compute) but no high-bandwidth GPU-to-GPU link. All transfers go through PCIe.

Q: Can I mix a 1080 Ti and a different GPU? Yes, but Ollama splits by available memory per card. Pairing a 1080 Ti with a 3060 12 GB is mostly limited by the slower/smaller card. Pairing a 1080 Ti with a 3090 24 GB — the 3090 alone will outperform the pair for any model fitting in 24 GB.

Q: Does CPU matter much? For inference: not much. Any modern 8-core CPU is fine. Where CPU matters: prompt processing (initial tokenization), model loading, and CPU-offload for models that don't fit GPU.

Q: What about tensor_split parameter directly in the Modelfile? Ollama doesn't expose tensor_split as a first-class option — it's hidden behind SCHED_SPREAD. If you need fine control, drop to llama.cpp directly:

llama-cli -m model.gguf -ngl 99 --split-mode layer --tensor-split 11,11 ...

See llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition for the fine-grained version.

Q: Will Ollama support tensor parallelism in the future? Ollama's multi-GPU is currently "layer split" — partitioning model layers across GPUs. True tensor parallelism (splitting within each layer's matrix multiplies) is more complex; vLLM does it but Ollama doesn't. For 1080 Ti this is moot since vLLM doesn't support Pascal.

Closing — One-Line Summary

For 2× GTX 1080 Ti without NVLink: set OLLAMA_SCHED_SPREAD=1 for 13B+ models, disable it (single card) for 8B, and accept that PCIe lane configuration is your single biggest performance variable.

Related posts:

References:

Ollama documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md
llama.cpp multi-GPU notes: https://github.com/ggerganov/llama.cpp/discussions
LocalLLaMA dual GPU threads (r/LocalLLaMA, 2024-2026)

Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

The Question

Setup Used for All Benchmarks

Step 1 — Verify Both Cards Are Visible

Step 2 — Environment Variables That Matter

Step 3 — Verify It's Actually Using Both GPUs

Step 4 — Benchmarks (Single vs Dual)

8B class (fits on one card)

13B class (tight on one card)

30B class (only possible split)

Per-card power draw under load

Pcie Lanes — The Single Biggest Variable

Common Multi-GPU Problems

Problem: "Ollama only uses one GPU"

Problem: "Model loads but generation is 1-2 tokens/sec"

Problem: "Same model is slower on dual than single"

Problem: "Random hangs or restarts during multi-GPU inference"

Problem: "Ollama works but `OLLAMA_SCHED_SPREAD` doesn't seem to take effect"

When the Dual Setup Is Worth It

How This Compares to a Single 3090

FAQ

Closing — One-Line Summary

관련 글

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

The Question

Setup Used for All Benchmarks

Step 1 — Verify Both Cards Are Visible

Step 2 — Environment Variables That Matter

Step 3 — Verify It's Actually Using Both GPUs

Step 4 — Benchmarks (Single vs Dual)

8B class (fits on one card)

13B class (tight on one card)

30B class (only possible split)

Per-card power draw under load

Pcie Lanes — The Single Biggest Variable

Common Multi-GPU Problems

Problem: "Ollama only uses one GPU"

Problem: "Model loads but generation is 1-2 tokens/sec"

Problem: "Same model is slower on dual than single"

Problem: "Random hangs or restarts during multi-GPU inference"

Problem: "Ollama works but OLLAMA_SCHED_SPREAD doesn't seem to take effect"

When the Dual Setup Is Worth It

How This Compares to a Single 3090

FAQ

Closing — One-Line Summary

관련 글

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Problem: "Ollama works but `OLLAMA_SCHED_SPREAD` doesn't seem to take effect"