Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?
I benchmarked Qwen3.6-35B-A3B (IQ4_XS) on a pair of 8-year-old GTX 1080 Ti cards. It runs at ~20 tokens/sec — and the answer to 'does the second GPU help?' is yes, but only ~20% faster, not 2×. Here are the real numbers, the VRAM math, and why a 35B model fits 22 GB at all.
TL;DR (Quick Answer)
I actually ran Qwen3.6-35B-A3B — a 35B-parameter mixture-of-experts model (only 3B active per token) — on a pair of 8-year-old GTX 1080 Ti cards (22 GB combined). Real, measured numbers:
- Generation speed: ~20 tokens/sec on 2× 1080 Ti (IQ4_XS quant), stable across runs (19.4 / 21.4 / 20.0).
- Single GPU: ~16.8 tok/s. So the second 1080 Ti buys ~20% more throughput — not 2×. Why: the MoE expert weights stay memory-mapped in CPU RAM either way; the second GPU just lets more of the model live in fast VRAM.
- It only "fits" because of the MoE + CPU-mmap trick. ~13 GB of the model sits on the two GPUs; ~18 GB of expert weights are mmap'd from CPU RAM, and only the active 3B runs each token.
- Quant matters for 22 GB: the default
qwen3.6:35b-a3btag is 23.9 GB and spills to CPU. You want ≤ IQ4_XS (~17.7 GB) to keep it (mostly) on the GPUs.
Bottom line: a 35B model is genuinely usable on used-$200 Pascal cards in 2026 — as long as it's a sparse MoE and you pick the right quant.
The setup (and one gotcha)
- GPUs: 2× NVIDIA GeForce GTX 1080 Ti (11 GB each, 22 GB total), Pascal, compute capability 6.1.
- Driver: 581.57 (Windows host, used via WSL2 passthrough). This matters — recent Ollama bundles CUDA 13, which refuses drivers older than 570. On the older 560 driver it silently fell back to CPU (
total_vram=0). Updating to 581 fixed it. - Ollama: v0.30.2. Interesting detail: its cuda_v13 build skips Pascal ("compute capability not in compiled architectures", cc 6.1), so it auto-falls back to the bundled cuda_v12 build to use the 1080 Ti. Good to know if you're on old hardware.
Why a "35B" model runs on old cards at all
Qwen3.6-35B-A3B is a mixture-of-experts (MoE): 35B total parameters, but only ~3B are active for any given token. So the compute per token is small (3B-class), even though all the experts must be available in memory.
That's the whole reason this works on Pascal: the GTX 1080 Ti has no tensor cores and modest FP16, so a dense 35B would crawl. A sparse 3B-active MoE keeps the per-token math light, and the bottleneck shifts to where the weights live — which is exactly what the dual-GPU question is about.
Quant fit on 22 GB
You can't just ollama pull qwen3.6:35b-a3b — that default is 23.9 GB and won't sit on 22 GB of VRAM. Measured GGUF sizes:
| Quant | Size | Fits 22 GB? |
|---|---|---|
| Q3_K_M | ~16.6–17.1 GB | ✅ comfortable |
| IQ4_XS | ~17.7 GB | ✅ best quality that fits |
| Q4_K_S | ~21 GB | ⚠️ too tight (spills with KV cache) |
| Q4_K_M / default | 23.9 GB+ | ❌ offloads to CPU |
I used IQ4_XS.
Results: single vs dual 1080 Ti
Same model (IQ4_XS), same prompt, num_predict=256, measured via Ollama's /api/generate:
| Config | Generation | Prefill | Model on GPU | Model on CPU (mmap) |
|---|---|---|---|---|
| 1× GTX 1080 Ti | ~16.8 tok/s | ~50 tok/s | ~3 GB | ~18 GB+ |
| 2× GTX 1080 Ti | ~20.3 tok/s | ~50 tok/s | ~13 GB (4 + 9.3) | ~18 GB |
- The second GPU is ~1.2× faster (≈ +20%), not 2×.
- Under load, the busier card drew up to ~101 W, GPU utilization sat around 26–33% — telling: the cards are waiting a lot, because the CPU-mmap'd experts are the bottleneck, not raw GPU FLOPs.
The honest answer to "does the second 1080 Ti help?": yes, modestly. It lets ~9 GB more of the model live in VRAM instead of CPU RAM, which trims the per-token overhead — but because the bulk of the experts stay CPU-side in both configs, you don't get linear scaling. If you were hoping a second card would double your tokens/sec, it won't, at least not for an MoE that overflows your combined VRAM.
Reproduce it
# (driver must be 570+ for current Ollama; check with: nvidia-smi)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS
# generate + read the eval rate
curl -s http://127.0.0.1:11434/api/generate -d '{
"model": "hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS",
"prompt": "Explain mixture-of-experts in 150 words.",
"stream": false,
"options": {"num_predict": 256}
}'
# tokens/sec = eval_count / (eval_duration / 1e9)
To force a single GPU for comparison, start the server with CUDA_VISIBLE_DEVICES=0 ollama serve.
Honest Limitations
- One quant, one model, one box. IQ4_XS on 2× 1080 Ti; your tokens/sec will shift with quant, context length, CPU, and RAM speed.
- Prefill measured on a short prompt (~55 tokens) — treat ~50 tok/s as a ballpark; long-context prefill on Pascal will be slower.
- IQ4_XS is a ~4-bit quant — fine for chat/drafting, but it's not full-precision quality.
- MoE-specific. These conclusions (the modest dual-GPU gain, the CPU-mmap behavior) are about this sparse MoE. A dense model that fully fits VRAM would scale differently across two cards.
- A few runs, not a statistical study — numbers are representative, not p-valued.
FAQ
Q: Can a GTX 1080 Ti really run a 35B model in 2026?
A sparse MoE one, yes — Qwen3.6-35B-A3B at IQ4_XS ran ~20 tok/s on two of them. A dense 35B would not be usable. The 3B-active design is what makes it work.
Q: Will a second 1080 Ti double my speed?
No. Here it added ~20%. The MoE experts stay memory-mapped in CPU RAM in both single- and dual-GPU setups, so the second card helps but doesn't scale linearly.
Q: Why did Ollama ignore my GPU until I updated the driver?
Recent Ollama bundles CUDA 13, which requires NVIDIA driver ≥ 570. On an older driver it falls back to CPU silently. Update the driver; Ollama then uses its cuda_v12 build for Pascal cards.
Q: Which quant should I use on 22 GB?
IQ4_XS (~17.7 GB) for the best quality that stays (mostly) on the GPUs; Q3_K_M if you want more headroom for context. Avoid the 23.9 GB default — it spills to CPU.
Resources
- Model: Qwen3.6-35B-A3B GGUF (bartowski)
- Ollama · benchmark via
/api/generate(eval_count/eval_duration).
관련 글
Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
5월 27일 · 15 min read
일반GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 12 min read
Local LLMRunning 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data
6월 3일 · 11 min read
일반Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)
5월 27일 · 10 min read