AI/LLM

How to Run Qwen 3 (30B) Locally with Ollama on RTX 3090 — Complete Guide

Step-by-step guide to running Qwen 3 30B MoE model locally on an NVIDIA RTX 3090 (24GB VRAM) using Ollama. Includes performance benchmarks, port forwarding setup, and API authentication with Caddy reverse proxy.

·7 min read
#ollama#qwen3#local LLM#RTX 3090#self-hosted AI#MoE#VRAM#GPU inference

I've been running Qwen 3 30B locally on my RTX 3090 for weeks now — serving it as an API to my web apps, using it for RAG-based biomedical research, and chatting with it daily. Here's everything I learned setting it up.

NVIDIA RTX GPU computing setup

Why Run LLMs Locally?

  • Zero API costs — No per-token billing. I was spending $50+/month on Claude API
  • Privacy — My proteomics research data never leaves my network
  • No rate limits — As fast as your GPU allows
  • Full control — Custom system prompts, fine-tuning, any model you want

My Hardware Setup

ComponentSpec
GPUNVIDIA RTX 3090 (24GB VRAM)
Host OSWindows 11
CPUAMD Ryzen (32GB RAM)
NetworkPort-forwarded for remote access
VMUbuntu 24.04 (VirtualBox) for Docker services

The RTX 3090 is the sweet spot for local LLM inference in 2026 — 24GB VRAM at ~$700 used. The newer RTX 4090 is faster but costs 2x for marginal gains.

Step 1: Install Ollama on Windows

Download from ollama.com and install. It's a single executable.

# Verify installation
ollama --version
# ollama version 0.6.x

Ollama runs as a background service on port 11434 by default.

Step 2: Pull Qwen 3 30B

ollama pull qwen3:30b

This downloads the Qwen 3 30B MoE (Mixture of Experts) model — about 18GB. The key insight: while it's a 30B parameter model, only ~8B parameters are active per token thanks to MoE architecture. This means:

  • Fits in 24GB VRAM comfortably
  • Inference speed comparable to a dense 8B model
  • Quality much closer to a dense 30B model

Other Models I've Tested on 24GB VRAM

ModelSizeSpeedQualityFits 24GB?
qwen3:30b (MoE)18GB★★★★★★★★★✅ Comfortable
deepseek-r1:32b19GB★★★★★★★✅ Tight
gemma3:27b16GB★★★★★★★★✅ Comfortable
devstral14GB★★★★★★★★✅ Easy
qwq (32b reasoning)19GB★★★★★★★✅ Tight

My recommendation: Start with qwen3:30b. Best balance of speed and quality for 24GB.

Step 3: Test Locally

# Interactive chat
ollama run qwen3:30b

# API call
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:30b",
  "prompt": "Explain proteomics in one paragraph",
  "stream": false
}'

You should get a response in 2-5 seconds for short prompts.

Step 4: Remote Access via Port Forwarding

I needed to access my Ollama server from:

  • My Mac mini (different machine on same network)
  • My Vercel-deployed web app (sysofti.com)
  • Mobile when away from home

Router Port Forwarding

On your router admin page, forward:

External Port 11434 → Internal IP:Port (your Windows PC's IP:11434)

Now http://your-public-ip:11434 works from anywhere.

⚠️ WARNING: This exposes your Ollama API to the entire internet with ZERO authentication. Anyone can use your GPU. We fix this in the next step.

Step 5: Secure with Caddy + API Key Authentication

This is the part most tutorials skip. Ollama has NO built-in authentication. I solved this with Caddy as a reverse proxy.

Install Caddy on Windows

Download from caddyserver.com. Place caddy.exe somewhere permanent.

Create Caddyfile

:11435 {
    @valid_key header X-API-Key YOUR_SECRET_KEY_HERE

    handle @valid_key {
        reverse_proxy localhost:11434
    }

    handle {
        respond "Unauthorized" 401
    }
}

Start Caddy

caddy run --config Caddyfile

Update Port Forwarding

Change your router to forward:

External Port 11434 → Internal IP:11435 (Caddy port)

Now the flow is:

Internet → Router:11434 → Caddy:11435 (checks API key) → Ollama:11434

Test Authentication

# Without key → 401 Unauthorized
curl http://your-public-ip:11434/api/tags
# Unauthorized

# With key → 200 OK
curl -H "X-API-Key: YOUR_SECRET_KEY_HERE" http://your-public-ip:11434/api/tags
# {"models":[{"name":"qwen3:30b",...}]}

Step 6: Use from Your Applications

Python (FastAPI backend)

import requests

OLLAMA_URL = "http://your-public-ip:11434"
OLLAMA_KEY = "YOUR_SECRET_KEY_HERE"

def query_llm(prompt: str, model: str = "qwen3:30b"):
    response = requests.post(
        f"{OLLAMA_URL}/api/generate",
        headers={"X-API-Key": OLLAMA_KEY},
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

JavaScript/TypeScript (Next.js)

const response = await fetch(`${OLLAMA_URL}/api/chat`, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-API-Key': process.env.OLLAMA_API_KEY!,
  },
  body: JSON.stringify({
    model: 'qwen3:30b',
    messages: [{ role: 'user', content: prompt }],
    stream: false,
  }),
})

Embeddings for RAG

# Using nomic-embed-text for vector embeddings
ollama pull nomic-embed-text

curl -H "X-API-Key: YOUR_KEY" http://your-ip:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "BRCA1 protein expression in breast cancer"
}'
# Returns 768-dimensional vector

Performance Benchmarks (RTX 3090)

Real-world numbers from my daily usage:

TaskModelTokens/secLatency
Chat (short)qwen3:30b~35 tok/s1-2s first token
Chat (long context)qwen3:30b~25 tok/s3-5s first token
Code generationdevstral~45 tok/s1s first token
Embedding (768d)nomic-embed-textN/A~50ms per text
Reasoningqwq~15 tok/s5-10s first token

Common Issues & Solutions

"Out of memory" Error

# Check VRAM usage
nvidia-smi

# If another model is loaded, unload it first
curl http://localhost:11434/api/generate -d '{"model":"qwen3:30b","keep_alive":0}'

Slow First Response

Ollama loads the model into VRAM on first request. Set keep_alive to keep it loaded:

# Keep model loaded for 24 hours
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:30b",
  "keep_alive": "24h",
  "prompt": "hello"
}'

Caddy Not Starting on Windows

Run as Administrator, or create a Windows Service:

sc.exe create Caddy binPath= "C:\path\to\caddy.exe run --config C:\path\to\Caddyfile"
sc.exe start Caddy

Cost Comparison: Local vs Cloud

Local (RTX 3090)Claude APIOpenAI API
Upfront~$700 (used GPU)$0$0
Monthly~$15 electricity$50-200+$50-200+
Per token$0$0.003-0.015/1K$0.002-0.01/1K
Break-even4-5 months
PrivacyFullData sent to cloudData sent to cloud

After 5 months, local inference is essentially free forever.

My Full Stack

For reference, here's my complete self-hosted AI setup:

  • LLM: Ollama + qwen3:30b on RTX 3090 (Windows)
  • Embeddings: nomic-embed-text (768 dims) via same Ollama
  • Auth: Caddy reverse proxy with X-API-Key
  • Backend: FastAPI + Python in Docker (Ubuntu VM)
  • Frontend: Next.js on Vercel (sysofti.com)
  • Database: PostgreSQL (Supabase local) for vectors + results
  • RAG: 1200+ biomedical embeddings for semantic search

Total monthly cost: ~$15 (electricity only).

Conclusion

Running Qwen 3 30B locally on an RTX 3090 is genuinely practical in 2026. The MoE architecture makes it fit comfortably in 24GB VRAM while delivering quality that rivals much larger models. Combined with Caddy for authentication and port forwarding for remote access, you get a production-ready LLM API for essentially free.

The setup took me about 2 hours. The monthly savings compared to cloud APIs paid for the GPU in 4 months.

Next up: I'll cover how I built a full RAG pipeline with these local models for biomedical research — semantic search over 1200+ protein/disease embeddings with zero cloud dependency.


Have questions about running LLMs locally? Feel free to reach out. I'm happy to share more details about any part of this setup.

관련 글