Local LLM

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

I benchmarked Qwen3.6-35B-A3B (IQ4_XS) on a pair of 8-year-old GTX 1080 Ti cards. It runs at ~20 tokens/sec — and the answer to 'does the second GPU help?' is yes, but only ~20% faster, not 2×. Here are the real numbers, the VRAM math, and why a 35B model fits 22 GB at all.

·6 min read
#GTX 1080 Ti#Qwen3.6#MoE#Ollama#local LLM#dual GPU#benchmark#Pascal#IQ4_XS#22GB VRAM

Two GTX 1080 Ti cards running a 35B MoE

TL;DR (Quick Answer)

I actually ran Qwen3.6-35B-A3B — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of 8-year-old GTX 1080 Ti cards (22 GB combined). Real, measured numbers:

  • Generation speed: ~20 tokens/sec on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).
  • Single GPU: ~16.8 tok/s. So the second 1080 Ti buys ~20% more throughput — not 2×. Why: the MoE expert weights stay memory-mapped in CPU RAM either way; the second GPU just lets more of the model live in fast VRAM.
  • It only "fits" because of the MoE + CPU-mmap trick. ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.
  • Quant matters for 22 GB: the default qwen3.6:35b-a3b tag is 23.9 GB and spills to CPU. You want ≤ IQ4_XS (~17.7 GB) to keep it (mostly) on the GPUs.

Bottom line: a 35B model is genuinely usable on used-$200 Pascal cards in 2026 — as long as it's a sparse MoE and you pick the right quant.

The setup (and one gotcha)

  • GPUs: 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.
  • Driver: 581.57 (Windows host, used via WSL2 passthrough). This matters — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (total_vram=0). Updating to 581 fixed it.
  • Ollama: v0.30.2. Interesting detail: its cuda_v13 build skips Pascal ("compute capability not in compiled architectures", cc 6.1), so it auto-falls back to the bundled cuda_v12 build to use the 1080 Ti. Good to know if you're on old hardware.

Why a "35B" model runs on old cards at all

Qwen3.6-35B-A3B is a mixture-of-experts (MoE): 35B total parameters, but only ~3B are active for any given token. So the compute per token is small (3B-class), even though all the experts must be available in memory.

That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to where the weights live — which is exactly what the dual-GPU question is about.

Quant fit on 22 GB

You can't just ollama pull qwen3.6:35b-a3b — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:

QuantSizeFits 22 GB?
Q3_K_M~16.6–17.1 GB✅ comfortable
IQ4_XS~17.7 GB✅ best quality that fits
Q4_K_S~21 GB⚠️ too tight (spills with KV cache)
Q4_K_M / default23.9 GB+❌ offloads to CPU

I used IQ4_XS.

Results: single vs dual 1080 Ti

Same model (IQ4_XS), same prompt, num_predict=256, measured via Ollama's /api/generate:

ConfigGenerationPrefillModel on GPUModel on CPU (mmap)
1× GTX 1080 Ti~16.8 tok/s~50 tok/s~3 GB~18 GB+
2× GTX 1080 Ti~20.3 tok/s~50 tok/s~13 GB (4 + 9.3)~18 GB
  • The second GPU is ~1.2× faster (≈ +20%), not 2×.
  • Under load, the busier card drew up to ~101 W, GPU utilization sat around 26–33% — telling: the cards are waiting a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.

The honest answer to "does the second 1080 Ti help?": yes, modestly. It lets ~9 GB more of the model live in VRAM instead of CPU RAM, which trims the per-token overhead — but because the bulk of the experts stay CPU-side in both configs, you don't get linear scaling. If you were hoping a second card would double your tokens/sec, it won't, at least not for an MoE that overflows your combined VRAM.

Reproduce it

# (driver must be 570+ for current Ollama; check with: nvidia-smi)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

# generate + read the eval rate
curl -s http://127.0.0.1:11434/api/generate -d '{
  "model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
  "prompt": "Explain mixture-of-experts in 150 words.",
  "stream": false,
  "options": {"num_predict": 256}
}'
# tokens/sec = eval_count / (eval_duration / 1e9)

To force a single GPU for comparison, start the server with CUDA_VISIBLE_DEVICES=0 ollama serve.

Honest Limitations

  1. One quant, one model, one box. IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.
  2. Prefill measured on a short prompt (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.
  3. IQ4_XS is a ~4-bit quant — fine for chat/drafting, but it's not full-precision quality.
  4. MoE-specific. These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about this sparse MoE. A dense model that fully fits VRAM would scale differently across two cards.
  5. A few runs, not a statistical study — numbers are representative, not p-valued.

FAQ

Q: Can a GTX 1080 Ti really run a 35B model in 2026?

A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A dense 35B would not be usable. The 3B-active design is what makes it work.

Q: Will a second 1080 Ti double my speed?

No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.

Q: Why did Ollama ignore my GPU until I updated the driver?

Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.

Q: Which quant should I use on 22 GB?

IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.

Resources

관련 글