Review or Reviews
테크, 개발, AI, 하드웨어 — 실사용 기반 리뷰와 가이드
최신 글
What Actually Runs Well on a GTX 1080 Ti in 2026 (Measured)
The 'GPU poor' narrative says 24GB-and-below cards are eating well now thanks to QAT and MTP. But what about an 8-year-old 11GB GTX 1080 Ti? I measured it: Gemma 4 12B QAT at ~32 tok/s, Qwen3 8B at ~46, all fully on the GPU. Here's the table and where the ceiling is.
MTP Isn't Always a Win: 1.95× on My 3090, but Speculative Decoding Is Hardware-Dependent
MTP gave Gemma 4 12B QAT a 1.95x generation speedup on my 3090. But the same model with the same MTP draft runs 0.87x — slower — on an M1 Max. Speculative decoding is a hardware-dependent lever, not a free switch. Here are the measured numbers and why the draft-to-verify ratio decides it.
Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k
QAT is the buzz around Gemma 4, so I ran it on actual old hardware. The quality claim holds up (vs naive Q4), the speed win is modest (~9%), and yes — you can run the 12B on an 8 GB card at 16k context. Here are the measured numbers and the exact recipe.
더 보기
The Prefill Wall: Why MTP's 2× Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)
My last post doubled generation with MTP. A reader asked the question I'd skipped — what about prompt processing at long context? I measured prefill across context sizes on a 3090: a 64k prompt takes ~59s before the first token, and MTP can't touch that. Here's the math on when MTP's 2× actually matters, and when prefill swallows it.
Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)
A commenter pointed me at a faster backend and multi-token prediction to roughly double my 3090's throughput. I measured it one lever at a time: 35.7 tok/s on Ollama → ~75 with MTP, a real ~2.1× (a community re-test corrected my first lucky 80.2 draw). Here's the exact path, with the numbers and the gotchas.
The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)
Ollama sizes the KV cache to your context length, and the default can quietly push a model that fits in VRAM into a CPU spill — cutting throughput. A full num_ctx sweep of Qwen3.6-27B on a single RTX 3090 shows exactly where the cliff is, and why a bigger context is not free.
Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)
A field report: building a private, fully-offline hybrid-retrieval RAG over my own papers across old and new GPUs — the embedder that froze the whole GPU, the context setting that halved my speed, and why pooling the cards was a trap. Plus an MCP server so an agent can cite my corpus.
Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field
I pulled the just-released Gemma 4 12B and ran it on a GTX 1080 Ti. ~28 tok/s at Q4 on one card — but three things broke first, and going to Q8 (split across two cards, 30% slower) fixed both the token glitches and a domain answer the Q4 got confidently wrong.
Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data
A field report: a CPU-only, GPU-less distributed LLM pipeline (llama.cpp + quantized MoE) mining 10,000 papers — and the 4 silent data-quality bugs that nearly ruined the results.