Review or Reviews

테크, 개발, AI, 하드웨어 — 실사용 기반 리뷰와 가이드

더 보기

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

I benchmarked Qwen3.6-35B-A3B (IQ4_XS) on a pair of 8-year-old GTX 1080 Ti cards. It runs at ~20 tokens/sec — and the answer to 'does the second GPU help?' is yes, but only ~20% faster, not 2×. Here are the real numbers, the VRAM math, and why a 35B model fits 22 GB at all.

6/3

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

Practical build guide for running four GTX 1080 Tis in a single rig — 44 GB combined VRAM at roughly half the cost of a used RTX 3090. Covers PCIe slot configurations on HEDT and Threadripper boards, 1500W+ PSU sizing, cooling (1000W heat dissipation), llama.cpp tensor-split setup, expected throughput on 70B Llama, Mixtral 8×7B, and Qwen3.6-35B-A3B, plus the honest cases where this is not the right choice.

5/27

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.

5/27

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Practical deep dive into Ollama's OLLAMA_KEEP_ALIVE — the variable that controls whether your loaded model stays in VRAM or gets unloaded after each request. Covers timeout semantics, multi-model scheduling, the per-request keep_alive parameter, and how to optimize for single-user, multi-user, and shared-VRAM scenarios.

5/27

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Qwen3.6-35B-A3B (April 2026 release) puts a 35B-parameter MoE model on a single RTX 3090 24GB at usable speed thanks to its 3B active parameters and Apache 2.0 license. Practical use cases — agentic coding (SWE-bench 73.4), 262K context document analysis, vision-language tasks, and tool calling — with realistic VRAM math, expected throughput, and where the model genuinely outperforms 8B alternatives.

5/27

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

When llama.cpp's --split-mode row beats layer on dual-GPU inference, when layer is faster, and why the answer is different on Pascal/Turing without NVLink than on Ampere with NVLink. Real benchmarks on 2× GTX 1080 Ti for Mixtral, Yi-34B, Llama 3.1 13B, with PCIe lane and tensor split notes.

5/23

모든 글 보기 →

Review or Reviews

최신 글

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

더 보기

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)