Review or Reviews
테크, 개발, AI, 하드웨어 — 실사용 기반 리뷰와 가이드
최신 글
4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks
Practical build guide for running four GTX 1080 Tis in a single rig — 44 GB combined VRAM at roughly half the cost of a used RTX 3090. Covers PCIe slot configurations on HEDT and Threadripper boards, 1500W+ PSU sizing, cooling (1000W heat dissipation), llama.cpp tensor-split setup, expected throughput on 70B Llama, Mixtral 8×7B, and Qwen3.6-35B-A3B, plus the honest cases where this is not the right choice.
GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.
Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)
Practical deep dive into Ollama's OLLAMA_KEEP_ALIVE — the variable that controls whether your loaded model stays in VRAM or gets unloaded after each request. Covers timeout semantics, multi-model scheduling, the per-request keep_alive parameter, and how to optimize for single-user, multi-user, and shared-VRAM scenarios.
더 보기
Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
Qwen3.6-35B-A3B (April 2026 release) puts a 35B-parameter MoE model on a single RTX 3090 24GB at usable speed thanks to its 3B active parameters and Apache 2.0 license. Practical use cases — agentic coding (SWE-bench 73.4), 262K context document analysis, vision-language tasks, and tool calling — with realistic VRAM math, expected throughput, and where the model genuinely outperforms 8B alternatives.
llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)
When llama.cpp's --split-mode row beats layer on dual-GPU inference, when layer is faster, and why the answer is different on Pascal/Turing without NVLink than on Ampere with NVLink. Real benchmarks on 2× GTX 1080 Ti for Mixtral, Yi-34B, Llama 3.1 13B, with PCIe lane and tensor split notes.
Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)
How to make Ollama actually use both GTX 1080 Ti cards without NVLink — environment variables, tensor split configuration, and real tokens/sec benchmarks for 13B and 30B-class models. Where PCIe becomes the bottleneck, what works versus what just looks like it's working, and how the same setup compares to a single 3090.
Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works, What OOMs
A 2026 reality check for the GTX 1080 Ti: 11 GB VRAM, Pascal architecture, no FP16 tensor cores. Which modern LLMs (Llama 3.1, Qwen 3, Phi-4, Gemma 3) still load and run usefully, what hits OOM, real tokens/sec numbers from a 1080 Ti, and when it's time to retire the card.
Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
Definitive comparison of the three most popular local LLM inference engines in 2026. Real performance benchmarks on RTX 3090, feature-by-feature matrix, setup walkthroughs, and a decision framework for picking the right tool for your use case.
Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks (Qwen3 vs DeepSeek vs Llama)
Real Ollama benchmarks on RTX 3090 24GB — tokens/sec, VRAM, quality scores for 12+ models. Qwen3-30B vs DeepSeek-Coder-V3 vs Llama 4 head-to-head. Plus RTX 4090 comparison, cloud API cost analysis, and which local LLM to pick for your use case in 2026.