llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

Q: Mixed-architecture pair (e.g., 1080 Ti + 3060)?

Layer split, definitely. Use `--tensor-split` to weight the faster card more (e.g., `--tensor-split 8,11` to put more on the 11 GB 3060). Mixing architectures forces layer mode anyway — row requires symmetric compute.

Q: How do I see if NVLink is active?

```bash nvidia-smi nvlink --status ``` If you see actual link info, NVLink is connected. On 1080 Ti you'll see "GPU does not support NVLink" — confirming Pascal consumer cards never had it.

Q: Pipeline parallelism via `--pipeline` parallel — different from layer split?

`--pipeline-parallel-size` in llama.cpp is a finer-grained version of layer split that overlaps compute across stages. On dual GPU it's effectively layer split. The flag matters more for 4+ GPU setups.

llama.cpp split mode benchmark

Why This Question Keeps Coming Up

llama.cpp gives you two ways to split a model across multiple GPUs:

--split-mode layer   # partition by layers (default)
--split-mode row     # partition by rows within each layer's weight matrices

The default layer works fine on most setups, but row shows up in LocalLLaMA discussions as "actually faster on my setup." Both claims are right and wrong depending on architecture, NVLink presence, and PCIe configuration.

This guide is the empirical answer for the case that gets less love online: old consumer GPUs (Pascal GTX 1080 Ti, Pascal P40, Turing RTX 2080), no NVLink, mixed PCIe lanes. It's based on benchmarks from a 2× 1080 Ti rig and cross-checked against community results for P40 and 2080 Ti.

The Difference Between Layer Split and Row Split

Layer split (`--split-mode layer`, default)

GPU 0: layers 0-15
GPU 1: layers 16-31
PCIe transfer: hidden state (one vector) per layer boundary

Compute pattern: each GPU runs its assigned layers fully, then hands a single hidden-state vector to the next GPU
Bandwidth need: low — one ~4-12 KB tensor per layer crossing
Synchronization: minimal — pipeline parallelism
Best when: PCIe bandwidth between GPUs is limited (no NVLink, mixed lanes)

Row split (`--split-mode row`)

GPU 0: rows 0..N/2 of each layer's matrices
GPU 1: rows N/2..N of each layer's matrices
PCIe transfer: full activation tensor at every all-reduce point

Compute pattern: every layer runs on both GPUs simultaneously; results are concatenated via all-reduce
Bandwidth need: HIGH — full activation tensors transferred frequently
Synchronization: heavy — both GPUs must finish each partial matmul before continuing
Best when: GPUs have NVLink (or NVSwitch) for fast peer-to-peer transfers, AND tensor cores allow fast matmuls

The key insight: row split makes each GPU faster per layer, but adds inter-GPU traffic. On NVLink-equipped data center GPUs (A100, H100) the trade is favorable. On PCIe-only consumer GPUs, the math usually goes the other way.

Setup

Component	Spec
GPU 1	GTX 1080 Ti 11 GB, PCIe 3.0 x16
GPU 2	GTX 1080 Ti 11 GB, PCIe 3.0 x8
CPU	i7-9700K
RAM	64 GB DDR4-3200
llama.cpp	b3500+ build, CUDA backend
Driver	NVIDIA 555.x

Build llama.cpp with CUDA:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_CUDA=1 -j8

Benchmarks — Layer vs Row Split (Dual 1080 Ti)

Command pattern:

./llama-bench -m model.gguf \
  -ngl 99 \
  --split-mode layer \
  --tensor-split 11,11 \
  -p 512 -n 128

(Replace layer with row for the second column.)

Llama 3.1 13B Q4_K_M (~7.5 GB)

split-mode	Prompt eval t/s	Generation t/s	Total time (512 prompt + 128 gen)
layer	195	14.2	6.9 s
row	142	11.0	9.4 s

Layer wins. Pcie bandwidth between cards isn't enough for row's all-reduce overhead.

Mixtral 8×7B Q4_K_M (~21 GB)

split-mode	Prompt eval t/s	Generation t/s	Total time
layer	158	16.4	6.8 s
row	138	14.1	7.9 s

Mixtral is MoE — each token activates only ~13B params, so per-layer compute is modest. Layer split still wins on Pascal without NVLink, but the gap is smaller because there's less per-layer work being parallelized.

Yi-34B Q4_K_M (~20 GB)

split-mode	Prompt eval t/s	Generation t/s	Total time
layer	92	9.1	11.3 s
row	76	7.5	13.6 s

Dense 34B has heavier per-layer work, so row should benefit more — and the gap closes — but layer still wins. On Pascal, layer split is the right default in all tested cases.

Where row might win — single 8B on both cards

./llama-bench -m llama-3.1-8b-q4.gguf -ngl 99 --split-mode row --tensor-split 11,11

Even here, on dual 1080 Ti without NVLink, layer beats row by 10-15%. The model is small enough that PCIe overhead dominates regardless of mode.

Why Layer Wins on Pascal/PCIe-Only

Two compounding reasons:

No FP16 tensor cores: Pascal computes FP16 and FP32 at the same speed. Row split's per-GPU matmul speedup (normally from parallelizing across tensor cores) doesn't materialize.
PCIe bandwidth is small relative to compute time: layer split sends ~4-12 KB per layer transition. Row split sends ~hundreds of KB to MB per all-reduce. Without NVLink (50 GB/s), PCIe 3.0 x8 (~8 GB/s) is the bottleneck.

Put both effects together: row split's potential benefit (parallel matmul) is small, its cost (more PCIe traffic) is large. Layer wins.

When Row Would Win (Other Hardware)

The mode matters more on different hardware:

Hardware	Row vs Layer (large dense model)
2× A100 + NVLink	Row often wins (~10-20%)
2× RTX 3090 + NVLink bridge	Row sometimes wins
2× RTX 4090 (no NVLink supported)	Layer wins
2× RTX 3090 PCIe only	Layer wins (~5-10%)
2× P40 (Pascal, PCIe x8)	Layer wins, similar to 1080 Ti
2× RTX 2080 (Turing, no NVLink)	Layer wins

General rule: row needs NVLink to be worth it. If you can't see NVLink in nvidia-smi nvlink --status, default to layer.

Tensor Split — Allocating Layers Per GPU

The --tensor-split flag controls how layers (or rows) are distributed:

--tensor-split 11,11   # equal 50/50 split
--tensor-split 12,10   # 12 GB virtual budget GPU 0, 10 GB GPU 1
--tensor-split 1,0     # all on GPU 0

Numbers are relative weights, not absolute GB. 11,11 means "split equally"; 15,7 means "give GPU 0 about twice as many layers as GPU 1."

Use this when:

One GPU is on a faster PCIe slot (give it more)
One GPU is also driving the display (give it less, leave room for X server/Wayland)
One GPU is slower (give it fewer layers)

For asymmetric 1080 Ti on x16/x8 lanes, slight asymmetry helps:

--tensor-split 12,10

About 3-5% gain in our tests for 30B-class models. Negligible for 8B.

Sanity Checks Before Benchmarking

1. Is the model actually splitting?

# during inference, in another terminal:
nvidia-smi --query-gpu=index,memory.used --format=csv

For a 21 GB model on 2× 1080 Ti, expect ~10-11 GB on each card. If one is at 11 GB and the other at 0 GB, your --ngl is too low and llama.cpp put everything on GPU 0.

2. Are you using all GPU layers?

-ngl 99   # offload up to 99 layers — effectively all

If -ngl is less than the model's total layer count, the remaining run on CPU and that dominates the timing.

3. Is the model in GPU memory before benchmarking?

The first inference after model load takes longer (warmup). Run a short prompt to warm the cache, then run llama-bench. Or use the -r 3 flag to repeat 3 times and take the median.

4. Is anything else using GPU compute?

nvidia-smi --query-compute-apps=pid,process_name --format=csv

A stray Stable Diffusion process or a desktop compositor will skew results. Kill them before benchmarking.

Practical Recommendation — Dual 1080 Ti Settings

For day-to-day use of llama.cpp on 2× GTX 1080 Ti without NVLink:

./llama-cli -m model.gguf \
  --threads 8 \
  --ctx-size 4096 \
  --batch-size 512 \
  -ngl 99 \
  --split-mode layer \
  --tensor-split 12,10 \
  --flash-attn \
  --temp 0.7 \
  -p "your prompt here"

Key flags:

--split-mode layer — proven best on this hardware
--tensor-split 12,10 — slight asymmetry compensates for x8 second slot
--flash-attn — still helps on Pascal (memory locality), even without true Flash Attention 2
--ctx-size 4096 — leaves headroom for KV cache; bump if you have memory budget

If you're running Ollama instead, see the Ollama Dual 1080 Ti guide — Ollama abstracts most of this and the trade-offs are similar.

FAQ

Q: My P40 dual setup — same conclusions? Yes — P40 is Pascal (GP102), same architecture as 1080 Ti. Same layer-wins-over-row pattern. P40's 24 GB per card opens up 70B Q4 territory, but the inter-GPU mode preference is identical.

Q: 2× RTX 2080 Ti — does Turing change the answer? Turing has tensor cores but no NVLink (consumer Turing). Row split's per-GPU compute advantage is real but PCIe overhead still wins. Layer is faster. RTX 2080 Ti users typically see 5-10% gap, smaller than Pascal.

Q: Mixed-architecture pair (e.g., 1080 Ti + 3060)? Layer split, definitely. Use --tensor-split to weight the faster card more (e.g., --tensor-split 8,11 to put more on the 11 GB 3060). Mixing architectures forces layer mode anyway — row requires symmetric compute.

Q: Does --split-mode none work? That's "no split, use only first GPU." Useful when you want to test single-card baseline.

Q: Why is --flash-attn even relevant on Pascal? llama.cpp's implementation of flash-attention isn't full Flash Attention 2 (which needs Ampere+). The Pascal-compatible version still improves memory locality and reduces KV cache reads. ~10-15% speedup on long contexts.

Q: How do I see if NVLink is active?

nvidia-smi nvlink --status

If you see actual link info, NVLink is connected. On 1080 Ti you'll see "GPU does not support NVLink" — confirming Pascal consumer cards never had it.

Q: Pipeline parallelism via --pipeline parallel — different from layer split? --pipeline-parallel-size in llama.cpp is a finer-grained version of layer split that overlaps compute across stages. On dual GPU it's effectively layer split. The flag matters more for 4+ GPU setups.

Closing — TL;DR

On 2× GTX 1080 Ti (or any Pascal/Turing consumer pair without NVLink), always use --split-mode layer for multi-GPU inference. Row split only wins on hardware with NVLink and tensor cores. Asymmetric PCIe lanes (x16/x8 typical) make layer's preference even stronger. Combine with --tensor-split 12,10 to compensate for slot asymmetry.

Related posts:

References:

llama.cpp source: https://github.com/ggerganov/llama.cpp
llama.cpp discussions on split modes: https://github.com/ggerganov/llama.cpp/discussions/4541
NVIDIA Pascal whitepaper, 2017
LocalLLaMA multi-GPU threads, 2024-2026

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

Why This Question Keeps Coming Up

The Difference Between Layer Split and Row Split

Layer split (`--split-mode layer`, default)

Row split (`--split-mode row`)

Setup

Benchmarks — Layer vs Row Split (Dual 1080 Ti)

Llama 3.1 13B Q4_K_M (~7.5 GB)

Mixtral 8×7B Q4_K_M (~21 GB)

Yi-34B Q4_K_M (~20 GB)

Where row might win — single 8B on both cards

Why Layer Wins on Pascal/PCIe-Only

When Row Would Win (Other Hardware)

Tensor Split — Allocating Layers Per GPU

Sanity Checks Before Benchmarking

1. Is the model actually splitting?

2. Are you using all GPU layers?

3. Is the model in GPU memory before benchmarking?

4. Is anything else using GPU compute?

Practical Recommendation — Dual 1080 Ti Settings

FAQ

Closing — TL;DR

관련 글

Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

Why This Question Keeps Coming Up

The Difference Between Layer Split and Row Split

Layer split (--split-mode layer, default)

Row split (--split-mode row)

Setup

Benchmarks — Layer vs Row Split (Dual 1080 Ti)

Llama 3.1 13B Q4_K_M (~7.5 GB)

Mixtral 8×7B Q4_K_M (~21 GB)

Yi-34B Q4_K_M (~20 GB)

Where row might win — single 8B on both cards

Why Layer Wins on Pascal/PCIe-Only

When Row Would Win (Other Hardware)

Tensor Split — Allocating Layers Per GPU

Sanity Checks Before Benchmarking

1. Is the model actually splitting?

2. Are you using all GPU layers?

3. Is the model in GPU memory before benchmarking?

4. Is anything else using GPU compute?

Practical Recommendation — Dual 1080 Ti Settings

FAQ

Closing — TL;DR

관련 글

Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

Layer split (`--split-mode layer`, default)

Row split (`--split-mode row`)