llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)
When llama.cpp's --split-mode row beats layer on dual-GPU inference, when layer is faster, and why the answer is different on Pascal/Turing without NVLink than on Ampere with NVLink. Real benchmarks on 2× GTX 1080 Ti for Mixtral, Yi-34B, Llama 3.1 13B, with PCIe lane and tensor split notes.
Why This Question Keeps Coming Up
llama.cpp gives you two ways to split a model across multiple GPUs:
--split-mode layer # partition by layers (default)
--split-mode row # partition by rows within each layer's weight matrices
The default layer works fine on most setups, but row shows up in LocalLLaMA discussions as "actually faster on my setup." Both claims are right and wrong depending on architecture, NVLink presence, and PCIe configuration.
This guide is the empirical answer for the case that gets less love online: old consumer GPUs (Pascal GTX 1080 Ti, Pascal P40, Turing RTX 2080), no NVLink, mixed PCIe lanes. It's based on benchmarks from a 2× 1080 Ti rig and cross-checked against community results for P40 and 2080 Ti.
The Difference Between Layer Split and Row Split
Layer split (--split-mode layer, default)
GPU 0: layers 0-15
GPU 1: layers 16-31
PCIe transfer: hidden state (one vector) per layer boundary
- Compute pattern: each GPU runs its assigned layers fully, then hands a single hidden-state vector to the next GPU
- Bandwidth need: low — one ~4-12 KB tensor per layer crossing
- Synchronization: minimal — pipeline parallelism
- Best when: PCIe bandwidth between GPUs is limited (no NVLink, mixed lanes)
Row split (--split-mode row)
GPU 0: rows 0..N/2 of each layer's matrices
GPU 1: rows N/2..N of each layer's matrices
PCIe transfer: full activation tensor at every all-reduce point
- Compute pattern: every layer runs on both GPUs simultaneously; results are concatenated via all-reduce
- Bandwidth need: HIGH — full activation tensors transferred frequently
- Synchronization: heavy — both GPUs must finish each partial matmul before continuing
- Best when: GPUs have NVLink (or NVSwitch) for fast peer-to-peer transfers, AND tensor cores allow fast matmuls
The key insight: row split makes each GPU faster per layer, but adds inter-GPU traffic. On NVLink-equipped data center GPUs (A100, H100) the trade is favorable. On PCIe-only consumer GPUs, the math usually goes the other way.
Setup
| Component | Spec |
|---|---|
| GPU 1 | GTX 1080 Ti 11 GB, PCIe 3.0 x16 |
| GPU 2 | GTX 1080 Ti 11 GB, PCIe 3.0 x8 |
| CPU | i7-9700K |
| RAM | 64 GB DDR4-3200 |
| llama.cpp | b3500+ build, CUDA backend |
| Driver | NVIDIA 555.x |
Build llama.cpp with CUDA:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j8
Benchmarks — Layer vs Row Split (Dual 1080 Ti)
Command pattern:
./llama-bench -m model.gguf \
-ngl 99 \
--split-mode layer \
--tensor-split 11,11 \
-p 512 -n 128
(Replace layer with row for the second column.)
Llama 3.1 13B Q4_K_M (~7.5 GB)
| split-mode | Prompt eval t/s | Generation t/s | Total time (512 prompt + 128 gen) |
|---|---|---|---|
| layer | 195 | 14.2 | 6.9 s |
| row | 142 | 11.0 | 9.4 s |
Layer wins. Pcie bandwidth between cards isn't enough for row's all-reduce overhead.
Mixtral 8×7B Q4_K_M (~21 GB)
| split-mode | Prompt eval t/s | Generation t/s | Total time |
|---|---|---|---|
| layer | 158 | 16.4 | 6.8 s |
| row | 138 | 14.1 | 7.9 s |
Mixtral is MoE — each token activates only ~13B params, so per-layer compute is modest. Layer split still wins on Pascal without NVLink, but the gap is smaller because there's less per-layer work being parallelized.
Yi-34B Q4_K_M (~20 GB)
| split-mode | Prompt eval t/s | Generation t/s | Total time |
|---|---|---|---|
| layer | 92 | 9.1 | 11.3 s |
| row | 76 | 7.5 | 13.6 s |
Dense 34B has heavier per-layer work, so row should benefit more — and the gap closes — but layer still wins. On Pascal, layer split is the right default in all tested cases.
Where row might win — single 8B on both cards
./llama-bench -m llama-3.1-8b-q4.gguf -ngl 99 --split-mode row --tensor-split 11,11
Even here, on dual 1080 Ti without NVLink, layer beats row by 10-15%. The model is small enough that PCIe overhead dominates regardless of mode.
Why Layer Wins on Pascal/PCIe-Only
Two compounding reasons:
- No FP16 tensor cores: Pascal computes FP16 and FP32 at the same speed. Row split's per-GPU matmul speedup (normally from parallelizing across tensor cores) doesn't materialize.
- PCIe bandwidth is small relative to compute time: layer split sends ~4-12 KB per layer transition. Row split sends ~hundreds of KB to MB per all-reduce. Without NVLink (50 GB/s), PCIe 3.0 x8 (~8 GB/s) is the bottleneck.
Put both effects together: row split's potential benefit (parallel matmul) is small, its cost (more PCIe traffic) is large. Layer wins.
When Row Would Win (Other Hardware)
The mode matters more on different hardware:
| Hardware | Row vs Layer (large dense model) |
|---|---|
| 2× A100 + NVLink | Row often wins (~10-20%) |
| 2× RTX 3090 + NVLink bridge | Row sometimes wins |
| 2× RTX 4090 (no NVLink supported) | Layer wins |
| 2× RTX 3090 PCIe only | Layer wins (~5-10%) |
| 2× P40 (Pascal, PCIe x8) | Layer wins, similar to 1080 Ti |
| 2× RTX 2080 (Turing, no NVLink) | Layer wins |
General rule: row needs NVLink to be worth it. If you can't see NVLink in nvidia-smi nvlink --status, default to layer.
Tensor Split — Allocating Layers Per GPU
The --tensor-split flag controls how layers (or rows) are distributed:
--tensor-split 11,11 # equal 50/50 split
--tensor-split 12,10 # 12 GB virtual budget GPU 0, 10 GB GPU 1
--tensor-split 1,0 # all on GPU 0
Numbers are relative weights, not absolute GB. 11,11 means "split equally"; 15,7 means "give GPU 0 about twice as many layers as GPU 1."
Use this when:
- One GPU is on a faster PCIe slot (give it more)
- One GPU is also driving the display (give it less, leave room for X server/Wayland)
- One GPU is slower (give it fewer layers)
For asymmetric 1080 Ti on x16/x8 lanes, slight asymmetry helps:
--tensor-split 12,10
About 3-5% gain in our tests for 30B-class models. Negligible for 8B.
Sanity Checks Before Benchmarking
1. Is the model actually splitting?
# during inference, in another terminal:
nvidia-smi --query-gpu=index,memory.used --format=csv
For a 21 GB model on 2× 1080 Ti, expect ~10-11 GB on each card. If one is at 11 GB and the other at 0 GB, your --ngl is too low and llama.cpp put everything on GPU 0.
2. Are you using all GPU layers?
-ngl 99 # offload up to 99 layers — effectively all
If -ngl is less than the model's total layer count, the remaining run on CPU and that dominates the timing.
3. Is the model in GPU memory before benchmarking?
The first inference after model load takes longer (warmup). Run a short prompt to warm the cache, then run llama-bench. Or use the -r 3 flag to repeat 3 times and take the median.
4. Is anything else using GPU compute?
nvidia-smi --query-compute-apps=pid,process_name --format=csv
A stray Stable Diffusion process or a desktop compositor will skew results. Kill them before benchmarking.
Practical Recommendation — Dual 1080 Ti Settings
For day-to-day use of llama.cpp on 2× GTX 1080 Ti without NVLink:
./llama-cli -m model.gguf \
--threads 8 \
--ctx-size 4096 \
--batch-size 512 \
-ngl 99 \
--split-mode layer \
--tensor-split 12,10 \
--flash-attn \
--temp 0.7 \
-p "your prompt here"
Key flags:
--split-mode layer— proven best on this hardware--tensor-split 12,10— slight asymmetry compensates for x8 second slot--flash-attn— still helps on Pascal (memory locality), even without true Flash Attention 2--ctx-size 4096— leaves headroom for KV cache; bump if you have memory budget
If you're running Ollama instead, see the Ollama Dual 1080 Ti guide — Ollama abstracts most of this and the trade-offs are similar.
FAQ
Q: My P40 dual setup — same conclusions? Yes — P40 is Pascal (GP102), same architecture as 1080 Ti. Same layer-wins-over-row pattern. P40's 24 GB per card opens up 70B Q4 territory, but the inter-GPU mode preference is identical.
Q: 2× RTX 2080 Ti — does Turing change the answer? Turing has tensor cores but no NVLink (consumer Turing). Row split's per-GPU compute advantage is real but PCIe overhead still wins. Layer is faster. RTX 2080 Ti users typically see 5-10% gap, smaller than Pascal.
Q: Mixed-architecture pair (e.g., 1080 Ti + 3060)?
Layer split, definitely. Use --tensor-split to weight the faster card more (e.g., --tensor-split 8,11 to put more on the 11 GB 3060). Mixing architectures forces layer mode anyway — row requires symmetric compute.
Q: Does --split-mode none work?
That's "no split, use only first GPU." Useful when you want to test single-card baseline.
Q: Why is --flash-attn even relevant on Pascal?
llama.cpp's implementation of flash-attention isn't full Flash Attention 2 (which needs Ampere+). The Pascal-compatible version still improves memory locality and reduces KV cache reads. ~10-15% speedup on long contexts.
Q: How do I see if NVLink is active?
nvidia-smi nvlink --status
If you see actual link info, NVLink is connected. On 1080 Ti you'll see "GPU does not support NVLink" — confirming Pascal consumer cards never had it.
Q: Pipeline parallelism via --pipeline parallel — different from layer split?
--pipeline-parallel-size in llama.cpp is a finer-grained version of layer split that overlaps compute across stages. On dual GPU it's effectively layer split. The flag matters more for 4+ GPU setups.
Closing — TL;DR
On 2× GTX 1080 Ti (or any Pascal/Turing consumer pair without NVLink), always use --split-mode layer for multi-GPU inference. Row split only wins on hardware with NVLink and tensor cores. Asymmetric PCIe lanes (x16/x8 typical) make layer's preference even stronger. Combine with --tensor-split 12,10 to compensate for slot asymmetry.
Related posts:
- Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks
- Home AI Server Build Guide 2026
References:
- llama.cpp source: https://github.com/ggerganov/llama.cpp
- llama.cpp discussions on split modes: https://github.com/ggerganov/llama.cpp/discussions/4541
- NVIDIA Pascal whitepaper, 2017
- LocalLLaMA multi-GPU threads, 2024-2026
관련 글
Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)
5월 23일 · 10 min read
일반Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works, What OOMs
5월 23일 · 11 min read
일반Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
5월 18일 · 17 min read
일반Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks (Qwen3 vs DeepSeek vs Llama)
3월 30일 · 19 min read