블로그

총 35개의 글

전체Local LLM일반바이오인포매틱스개발AI/LLMDevOpsAI/MLSelf-HostingGPUs

Local LLM2026년 6월 23일· 7 min read

The Open-Model Cost Chart Everyone's Sharing Is API Prices. Here's What Self-Hosting Actually Gets You (Measured)

The intelligence-vs-cost chart making the rounds shows open models winning the value quadrant. True, but the x-axis is API token price. The cheap open winners are 100B-to-1T MoEs you can't run on a desktop GPU. Here's what you can actually self-host on an 11GB and a 24GB card, measured, and where the real ceiling is.

#local LLM#open source#self-hosting#GPU poor

Local LLM2026년 6월 19일· 10 min read

I Added a Verify Layer to My Local RAG to Catch Hallucinations. It Caught Me Being Wrong Twice About My Own Corpus

A claim-verification layer for a local RAG co-scientist, inspired by Karpathy's llm-wiki pattern. I tried to measure whether it catches hallucinations, almost shipped a false finding, and ended up with a clearer picture of what claim-checking can and can't do: it reliably catches values that are absent from the context, misses a real number pinned to the wrong question, and misses a false premise outright, and a model can't reliably referee its own blind spots.

#local LLM#RAG#hallucination#ollama

Local LLM2026년 6월 12일· 4 min read

What Actually Runs Well on a GTX 1080 Ti in 2026 (Measured)

The 'GPU poor' narrative says 24GB-and-below cards are eating well now thanks to QAT and MTP. But what about an 8-year-old 11GB GTX 1080 Ti? I measured it: Gemma 4 12B QAT at ~32 tok/s, Qwen3 8B at ~46, all fully on the GPU. Here's the table and where the ceiling is.

#local LLM#GTX 1080 Ti#Gemma 4#Qwen3

Local LLM2026년 6월 11일· 4 min read

MTP Isn't Always a Win: 1.95× on My 3090, but Speculative Decoding Is Hardware-Dependent

MTP gave Gemma 4 12B QAT a 1.95x generation speedup on my 3090. But the same model with the same MTP draft runs 0.87x — slower — on an M1 Max. Speculative decoding is a hardware-dependent lever, not a free switch. Here are the measured numbers and why the draft-to-verify ratio decides it.

#local LLM#MTP#speculative decoding#Gemma 4

Local LLM2026년 6월 10일· 6 min read

Gemma 4 QAT on a 1080 Ti: What 'Quantization-Aware' Actually Buys — and Fitting the 12B on 8 GB at 16k

QAT is the buzz around Gemma 4, so I ran it on actual old hardware. The quality claim holds up (vs naive Q4), the speed win is modest (~9%), and yes — you can run the 12B on an 8 GB card at 16k context. Here are the measured numbers and the exact recipe.

#local LLM#Gemma 4#QAT#quantization

Local LLM2026년 6월 10일· 6 min read

The Prefill Wall: Why MTP's 2× Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

My last post doubled generation with MTP. A reader asked the question I'd skipped — what about prompt processing at long context? I measured prefill across context sizes on a 3090: a 64k prompt takes ~59s before the first token, and MTP can't touch that. Here's the math on when MTP's 2× actually matters, and when prefill swallows it.

#local LLM#RTX 3090#prefill#long context

Local LLM2026년 6월 9일· 8 min read

Doubling Qwen3.6-27B on One RTX 3090: ollama → llama.cpp + MTP, Lever by Lever (35.7 → ~75 tok/s)

A commenter pointed me at a faster backend and multi-token prediction to roughly double my 3090's throughput. I measured it one lever at a time: 35.7 tok/s on Ollama → ~75 with MTP, a real ~2.1× (a community re-test corrected my first lucky 80.2 draw). Here's the exact path, with the numbers and the gotchas.

#local LLM#RTX 3090#llama.cpp#MTP

Local LLM2026년 6월 7일· 4 min read

The Ollama num_ctx Trap: a Default You Never Set Can Halve Your Tokens/sec (Full Sweep on a 3090)

Ollama sizes the KV cache to your context length, and the default can quietly push a model that fits in VRAM into a CPU spill — cutting throughput. A full num_ctx sweep of Qwen3.6-27B on a single RTX 3090 shows exactly where the cliff is, and why a bigger context is not free.

#Ollama#local LLM#num_ctx#KV cache

Local LLM2026년 6월 6일· 6 min read

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

A field report: building a private, fully-offline hybrid-retrieval RAG over my own papers across old and new GPUs — the embedder that froze the whole GPU, the context setting that halved my speed, and why pooling the cards was a trap. Plus an MCP server so an agent can cite my corpus.

#local LLM#RAG#GTX 1080 Ti#RTX 3090

Local LLM2026년 6월 5일· 6 min read

Running Brand-New Gemma 4 12B on an 8-Year-Old GTX 1080 Ti: Speed, 3 Gotchas, and Why Q8 Beat Q4 on My Own Field

I pulled the just-released Gemma 4 12B and ran it on a GTX 1080 Ti. ~28 tok/s at Q4 on one card — but three things broke first, and going to Q8 (split across two cards, 30% slower) fixed both the token glitches and a domain answer the Q4 got confidently wrong.

#Gemma 4#GTX 1080 Ti#Ollama#quantization

Local LLM2026년 6월 3일· 11 min read

Running 35B–400B LLMs on a GPU-less Cluster to Mine 10,000 Papers — and the 4 Bugs That Almost Ruined the Data

A field report: a CPU-only, GPU-less distributed LLM pipeline (llama.cpp + quantized MoE) mining 10,000 papers — and the 4 silent data-quality bugs that nearly ruined the results.

#llama.cpp#local LLM#MoE#CPU inference

Local LLM2026년 6월 3일· 7 min read

Running a 35B MoE (Qwen3.6-35B-A3B) on 2× GTX 1080 Ti in 2026 — Real Benchmarks, and Does the Second GPU Actually Help?

I benchmarked Qwen3.6-35B-A3B (IQ4_XS) on a pair of 8-year-old GTX 1080 Ti cards. It runs at ~20 tokens/sec — and the answer to 'does the second GPU help?' is yes, but only ~20% faster, not 2×. Here are the real numbers, the VRAM math, and why a 35B model fits 22 GB at all.

#GTX 1080 Ti#Qwen3.6#MoE#Ollama

일반2026년 5월 27일· 16 min read

4× GTX 1080 Ti for Local LLM in 2026 — 44GB Combined VRAM Build Guide + Real Benchmarks

Practical build guide for running four GTX 1080 Tis in a single rig — 44 GB combined VRAM at roughly half the cost of a used RTX 3090. Covers PCIe slot configurations on HEDT and Threadripper boards, 1500W+ PSU sizing, cooling (1000W heat dissipation), llama.cpp tensor-split setup, expected throughput on 70B Llama, Mixtral 8×7B, and Qwen3.6-35B-A3B, plus the honest cases where this is not the right choice.

#4x 1080 Ti#multi-GPU LLM#44GB VRAM#Llama 70B local

일반2026년 5월 27일· 12 min read

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Side-by-side comparison of GGUF quantization formats — Q4_K_M, Q4_K_S, IQ4_XS, Q5_K_M, Q5_K_S, Q8_0 — measured on Llama 3.1 8B and Qwen 3 14B with actual perplexity, MMLU accuracy, VRAM footprint, and tokens/sec on RTX 3090 and GTX 1080 Ti. Practical recommendations for picking the right quant for your hardware.

#GGUF#quantization#Q4_K_M#Q4_K_S

일반2026년 5월 27일· 10 min read

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Practical deep dive into Ollama's OLLAMA_KEEP_ALIVE — the variable that controls whether your loaded model stays in VRAM or gets unloaded after each request. Covers timeout semantics, multi-model scheduling, the per-request keep_alive parameter, and how to optimize for single-user, multi-user, and shared-VRAM scenarios.

#Ollama#OLLAMA_KEEP_ALIVE#model unload#VRAM management

일반2026년 5월 27일· 15 min read

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Qwen3.6-35B-A3B (April 2026 release) puts a 35B-parameter MoE model on a single RTX 3090 24GB at usable speed thanks to its 3B active parameters and Apache 2.0 license. Practical use cases — agentic coding (SWE-bench 73.4), 262K context document analysis, vision-language tasks, and tool calling — with realistic VRAM math, expected throughput, and where the model genuinely outperforms 8B alternatives.

#Qwen3.6#Qwen3.6-35B-A3B#RTX 3090#local LLM

일반2026년 5월 23일· 9 min read

llama.cpp --split-mode row vs layer on Multi-GPU — Old GPU Edition (1080 Ti, 2080, P40)

When llama.cpp's --split-mode row beats layer on dual-GPU inference, when layer is faster, and why the answer is different on Pascal/Turing without NVLink than on Ampere with NVLink. Real benchmarks on 2× GTX 1080 Ti for Mixtral, Yi-34B, Llama 3.1 13B, with PCIe lane and tensor split notes.

#llama.cpp#split-mode#tensor split#multi-GPU

일반2026년 5월 23일· 10 min read

Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti (Actual Benchmarks)

How to make Ollama actually use both GTX 1080 Ti cards without NVLink — environment variables, tensor split configuration, and real tokens/sec benchmarks for 13B and 30B-class models. Where PCIe becomes the bottleneck, what works versus what just looks like it's working, and how the same setup compares to a single 3090.

#Ollama dual GPU#GTX 1080 Ti dual#tensor split#no NVLink

일반2026년 5월 23일· 11 min read

Running Modern LLMs on GTX 1080 Ti in 2026 — What Still Works, What OOMs

A 2026 reality check for the GTX 1080 Ti: 11 GB VRAM, Pascal architecture, no FP16 tensor cores. Which modern LLMs (Llama 3.1, Qwen 3, Phi-4, Gemma 3) still load and run usefully, what hits OOM, real tokens/sec numbers from a 1080 Ti, and when it's time to retire the card.

#GTX 1080 Ti#Pascal GPU#local LLM old GPU#Ollama 1080 Ti

일반2026년 5월 18일· 17 min read

Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM

Definitive comparison of the three most popular local LLM inference engines in 2026. Real performance benchmarks on RTX 3090, feature-by-feature matrix, setup walkthroughs, and a decision framework for picking the right tool for your use case.

#Ollama#LM Studio#llama.cpp#local LLM

일반2026년 3월 30일· 20 min read

Best Ollama Models for RTX 3090 (2026): Qwen3 vs DeepSeek vs Llama Benchmarks

I benchmarked 12+ Ollama models on an RTX 3090 24GB — real tokens/sec, VRAM, and quality scores. See which local LLM wins in 2026: Qwen3, DeepSeek, or Llama 4.

#RTX 3090#Ollama#local LLM#Qwen3

일반2026년 3월 30일· 9 min read

Qwen3 vs DeepSeek R2 vs Llama 4 Local Performance — RTX 3090 24GB Benchmark 2026

Qwen3 vs DeepSeek R2 vs Llama 4 on an RTX 3090 24GB — real tokens/sec, VRAM, and quality scores across coding, reasoning, and writing. Which local LLM wins in 2026?

#Qwen3#DeepSeek R2#Llama 4#local LLM benchmark

일반2026년 3월 30일· 10 min read

$4,000 Home AI Server Build (2026): RTX 4090 vs 3090 vs 5090, Real Costs

My $4,000 home AI server for running LLMs locally — RTX 4090 vs 3090 vs 5090, full parts list, Ollama setup, power costs, and what I'd buy differently after 18 months.

#home AI server#local LLM server#RTX 4090#RTX 3090

바이오인포매틱스2026년 3월 18일· 15 min read

연구자를 위한 AI 어시스턴트 구축기: OpenClaw로 바이오인포매틱스 워크플로우 자동화하기

반복적인 프로테오믹스 분석 작업을 OpenClaw로 자동화하여 연구 효율성을 90% 이상 향상시킨 실제 경험담. DIA-NN 파이프라인 구축부터 바이오마커 데이터베이스 개발까지, 구체적인 구현 과정과 성과를 상세히 공개합니다.

#OpenClaw#바이오인포매틱스#프로테오믹스#AI자동화

개발2026년 2월 26일· 7 min read

Mac mini 24/7 서버로 쓰기 — 전기세부터 세팅까지

Mac mini M4를 24시간 상시 서버로 운영하면서 알게 된 전기세, 발열, 네트워크 설정, 원격 접속 노하우를 공유합니다. 가성비 홈서버의 정석.

#Mac mini#홈서버#24/7 서버#macOS

AI/LLM2026년 2월 25일· 5 min read

Securing Ollama with API Key Authentication Using Caddy Reverse Proxy

Ollama has no built-in authentication. Here's how I secured my public-facing Ollama API with Caddy reverse proxy and X-API-Key header validation — complete with Windows setup, CORS handling, and Vercel integration.

#ollama#caddy#reverse proxy#API authentication

AI/LLM2026년 2월 25일· 5 min read

Ollama vs ChatGPT in 2026: Is Running AI Locally Worth It?

Honest comparison between Ollama (local LLM) and ChatGPT/Claude cloud APIs in 2026. Cost analysis, quality benchmarks, privacy, and real-world use cases from someone who uses both daily.

#ollama#chatgpt#local LLM#AI comparison

AI/LLM2026년 2월 25일· 7 min read

How to Run Qwen 3 (30B) Locally with Ollama on RTX 3090 — Complete Guide

Step-by-step guide to running Qwen 3 30B MoE model locally on an NVIDIA RTX 3090 (24GB VRAM) using Ollama. Includes performance benchmarks, port forwarding setup, and API authentication with Caddy reverse proxy.

#ollama#qwen3#local LLM#RTX 3090

DevOps2026년 2월 23일· 9 min read

Docker 안에 R + Bioconductor 넣기 — 4GB 이미지와의 싸움

Python 백엔드에서 R의 limma, clusterProfiler, fgsea를 호출하기 위해 Docker에 R + Bioconductor를 넣은 과정. rpy2 S4 객체 변환 실패, ContextVar 스레드 에러, 4.13GB 이미지 최적화 삽질기.

#Docker#R#Bioconductor#rpy2

AI/ML2026년 2월 23일· 8 min read

RTX 3090으로 Claude 대체하기 — Ollama + Caddy 인증 구축기

Claude API 비용이 부담되어 RTX 3090에 Ollama를 올리고 Caddy reverse proxy로 API 인증까지 구축한 과정. qwen3:30b의 환각 문제와 대응 전략까지 솔직하게 공유한다.

#Ollama#RTX 3090#Caddy#LLM

개발2026년 2월 22일· 7 min read

Docker로 로컬 AI 서버 구축 — RTX 3090 삽질기

RTX 3090 GPU로 Docker 기반 로컬 AI 서버를 구축하면서 겪은 삽질과 해결 과정을 공유합니다. NVIDIA Container Toolkit 설정부터 LLM 서빙까지 실전 경험담.

#Docker#AI#GPU#RTX 3090

AI/ML2026년 2월 21일· 5 min read

로컬 LLM 설치 가이드 — Ollama로 내 컴퓨터에서 AI 돌리기 (2026)

Ollama 설치부터 로컬 LLM 실행까지 완벽 가이드. Llama 3, Mistral, Gemma 등 오픈소스 모델을 내 컴퓨터에서 프라이버시를 지키며 사용하는 방법을 단계별로 설명합니다.

#Ollama#로컬LLM#오픈소스AI#Llama3

Self-Hosting2026년 2월 20일· 8 min read

Self-Hosting vs Cloud — Which Is Actually Cheaper in 2026?

Self-hosting vs cloud: a detailed cost comparison for 2026. Calculate the real costs of running your own server vs cloud services for storage, email, media, and more.

#self-hosting#cloud#cost comparison#home server

GPUs2026년 2월 20일· 8 min read

RTX 5090 vs RTX 4090 — Benchmark Comparison and Upgrade Guide

RTX 5090 vs RTX 4090 benchmark comparison. Detailed performance analysis in gaming, AI, and content creation. Is the upgrade worth it? Full specs and real-world tests.

#RTX 5090#RTX 4090#benchmark#GPU comparison

개발2026년 2월 19일· 8 min read

Supabase 무료 티어로 SaaS 만들기 — 한계와 우회법

Supabase 무료 플랜으로 실제 SaaS를 구축하면서 겪은 한계와 이를 우회하는 실전 전략을 공유합니다. 데이터베이스 크기, API 요청, Auth, Storage 제한을 극복하는 방법.

#Supabase#SaaS#무료 티어#백엔드