Local LLM

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

A field report: building a private, fully-offline hybrid-retrieval RAG over my own papers across old and new GPUs — the embedder that froze the whole GPU, the context setting that halved my speed, and why pooling the cards was a trap. Plus an MCP server so an agent can cite my corpus.

·6 min read
#local LLM#RAG#GTX 1080 Ti#RTX 3090#Ollama#BGE-M3#Qdrant#reranking#MCP#quantization

A local LLM workstation with consumer GPUs

I wanted to ask questions about my own papers without shipping them to a cloud API. This is the real story of building that — a private, fully-offline RAG with hybrid retrieval and reranking — across a pile of old GPUs and one newer one. Three things each cost me the better part of a day, and none of them were what I expected.

The goal: a private RAG over my own papers

I'm a researcher with a folder of PDFs I can't (and won't) upload to a hosted API. I wanted natural-language, cited answers over that corpus, running entirely on my own hardware. So I built a small tool — paper-rag, about 200 lines of Python — with the whole stack local:

PDFs → chunk → BGE-M3 dense (Ollama) ┐
               BM25 sparse (fastembed)┴→ Qdrant (embedded, on disk)
                                            │
question → dense + sparse → RRF fusion → cross-encoder rerank → top passages
                                                                     │
                                                                     ▼
                                             local LLM (Ollama) → cited answer

Dense embeddings catch meaning; BM25 sparse catches exact terms (gene names, identifiers); fusing the two and then reranking with a cross-encoder gives the LLM far better context than cosine-alone. No server, no API key, nothing leaves the box.

The hardware was just what I had lying around: 2× GTX 1080 Ti (Pascal, 2017, 11 GB each) on one machine, and later a single RTX 3090 (24 GB, Ampere) on another. That old-plus-new mix is where the lessons came from.

Gotcha 1: the embedding model kept freezing the entire GPU

On the 1080 Ti box (running under WSL2), a long ingest would make the BGE-M3 embedder hang — and not gracefully. The llama-server process dropped into uninterruptible D-state, nvidia-smi itself stopped responding, and the whole GPU was wedged. No signal could kill it; only a full wsl --shutdown brought it back.

I chased the wrong causes first:

  • "It's the batch size." It wasn't. Once the model degraded, even a single 600-character chunk timed out at 90 seconds. Batch size wasn't the trigger.
  • "A newer Ollama will fix it." I checked the changelogs — the next few patch releases only fixed an unrelated model crash. Nothing about embeddings.

The actual fix was to stop using the GPU for that job at all. BGE-M3 is tiny (~1 GB), so I pinned it to CPU and kept the LLM on the GPU:

printf 'FROM bge-m3\nPARAMETER num_gpu 0\n' > bge-m3-cpu.Modelfile
ollama create bge-m3-cpu -f bge-m3-cpu.Modelfile
RAG_EMBED=bge-m3-cpu python rag.py ingest ./papers   # embeddings on CPU, answers still on the GPU

A ~100-chunk corpus embeds in well under a minute on CPU, and the GPU never wedges again. The lesson: on old Pascal cards under WSL the embedding path is where things break — and the embedder doesn't need the GPU in the first place.

Gotcha 2: my 27B ran at half speed until I capped the context

On the 3090 I loaded qwen3.6:27b (a 27.8B model, Q4, ~17.4 GB of weights) and saw ~17 tok/s. For a 27B that fits in 24 GB, that felt wrong.

ollama ps told the story: the model was loaded at 24.6 GB, with ~4 GB spilled to CPU. But the weights are only 17.4 GB — so what ate the other 7 GB? The KV cache. This model ships a 256K-token native context, and Ollama sized the cache to match it, overflowing VRAM and forcing an offload that throttled everything.

Cap the context to what you actually use and it all fits on the card:

RAG_NUM_CTX=8192 python rag.py ask "..."   # or options.num_ctx in your API call

Result: 100% on-GPU, ~36 tok/s — 2× faster — for a setting that loses nothing on RAG-sized prompts. The lesson: a huge native context silently taxes you even when you never use it. Set num_ctx to your real working size.

Gotcha 3: I have an old pair AND a new card — don't merge them

The tempting idea: pool the 2× 1080 Ti (22 GB) and the 3090 (24 GB) into one big inference cluster. You can (llama.cpp's RPC backend, or exo). You usually shouldn't.

The two machines are joined by a 1 GbE LAN — orders of magnitude slower than NVLink or PCIe. Cross-machine tensor parallelism is bottlenecked by that link, and the slow Pascal cards drag the fast Ampere one down to their level. Merging only pays off if you must run a model too big for any single box — and even then it's slow.

What actually works is role specialization:

  • 3090 → the latency-critical work: LLM generation + reranking + query embedding.
  • 1080 Ti box → throughput/batch work: bulk corpus embedding (on CPU, per Gotcha 1) and ingestion.

A small env knob points embeddings at one box and the LLM at another:

OLLAMA_URL=http://gpu-box:11434      RAG_LLM=qwen3.6:27b  \
RAG_EMBED_URL=http://127.0.0.1:11434 RAG_EMBED=bge-m3-cpu \
python mcp_server.py

Same bge-m3 on both boxes, so the dense vectors are compatible — you can even ingest on one machine and serve from another. ollama ps on each box confirms the split: bge-m3-cpu on the old box's CPU, the 27B on the 3090. The two machines do different jobs in parallel instead of fighting over one model.

The payoff: it's actually useful now

Hybrid + rerank produces noticeably better context, which shows up as cleaner cited answers. Asking "which gene passed all three colocalization tests, and why was DRD2 demoted?" over a couple of my preprints returns: SLC12A5 (KCC2) passed all three (SMR, HEIDI, COLOC), DRD2 demoted because it didn't colocalize — each claim tagged with the source page. Just as important, when the answer genuinely isn't in the retrieved context, it says so instead of inventing one.

And because the tool speaks MCP, I can call it from an agent (Claude, Cursor, …): search_papers and ask_papers show up as tools, so the agent searches and cites my corpus — still fully local. The whole RAG becomes something you plug into your AI client.

There's a quieter reason I want this grounded and local: small quantized models are confidently wrong on niche facts. (In a separate test, a Q4 model inverted the meaning of a statistical test — fluent and completely backwards.) A RAG anchored to the actual papers is how I catch that, instead of trusting the model's memory.

Honest limitations

  • Old Pascal cards are slow, and the embedder needs the CPU workaround above.
  • Reranking runs on CPU (fastembed/ONNX) — fine for a personal library, not a giant corpus.
  • It's a personal tool, not a production search system: naive fixed-size chunking, no semantic chunking yet.
  • Answer quality is your local model's quality — verify domain-specific claims.

Try it / let's compare notes

Code is MIT-licensed: github.com/shoo99/paper-rag — ~200 lines plus a small MCP server.

A couple of questions for anyone running similar setups:

  • For offload-heavy or multi-box rigs, has anyone gotten cross-machine pooling (llama.cpp RPC / exo) to actually beat just running a smaller model on the fastest single card?
  • What's your go-to reranker for local RAG — sticking with a small cross-encoder, or has something better shown up?

관련 글