Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)
Practical deep dive into Ollama's OLLAMA_KEEP_ALIVE — the variable that controls whether your loaded model stays in VRAM or gets unloaded after each request. Covers timeout semantics, multi-model scheduling, the per-request keep_alive parameter, and how to optimize for single-user, multi-user, and shared-VRAM scenarios.
Why This Matters More Than You'd Think
Loading an 8-14B GGUF model takes 5-15 seconds. That's per first request, every time the model is cold. If you're poking at Ollama interactively, this is invisible. If you're running a script that hits Ollama every few minutes (a Slack bot, a periodic enrichment job, a custom RAG app), every cold reload is a UX penalty for users and wasted GPU power-up cycles for the rest of us.
OLLAMA_KEEP_ALIVE is the variable that controls model unload behavior. Get it right and your model is always warm. Get it wrong and you either burn VRAM 24/7 or pay the cold-start penalty on every interaction.
This guide is the practical explanation of how it works in 2026 — defaults, edge cases, multi-model scheduling, and the cases where the per-request keep_alive parameter beats the environment variable.
Default Behavior (No Configuration)
When you start ollama serve with no environment variables, the default is:
- After a request completes, the model stays loaded in VRAM for 5 minutes
- Then the model is unloaded; next request reloads it (cold start)
This is conservative. Good if you're a multi-user shared workstation where VRAM matters; bad if you're the only user and want models always ready.
Setting OLLAMA_KEEP_ALIVE
Set as an environment variable for ollama serve:
# Examples
export OLLAMA_KEEP_ALIVE=24h # Keep model loaded for 24 hours after last use
export OLLAMA_KEEP_ALIVE=-1 # Keep loaded indefinitely (until manually unloaded)
export OLLAMA_KEEP_ALIVE=0 # Unload immediately after each request
export OLLAMA_KEEP_ALIVE=30m # 30 minutes
export OLLAMA_KEEP_ALIVE=5s # 5 seconds (testing only)
For systemd-managed Ollama:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
Then:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Verify it took:
systemctl show ollama --property Environment
Per-Request keep_alive Parameter
The API also accepts a per-request keep_alive value that overrides the environment default for that one request:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Hi",
"keep_alive": "1h"
}'
import requests
r = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1",
"prompt": "Summarize this email",
"keep_alive": "30m", # this request asks model to stay loaded 30 min after this completes
})
Special values for per-request:
0or"0"— unload model immediately after this request-1— keep loaded indefinitely- Duration strings:
"30s","5m","1h","24h"
What keep_alive Actually Counts From
A common confusion: the timer resets on every request. It doesn't count from when the model was first loaded.
Example with OLLAMA_KEEP_ALIVE=10m:
- T=0: load Llama 3.1 8B for first prompt → model in VRAM
- T=5min: another prompt → timer resets to 10 min from now
- T=12min: timer would have expired at T=15 (5 + 10) — model still loaded
- T=12min: another prompt → timer resets again
- T=23min: no requests since T=12 → model unloaded at T=22 (12 + 10)
This means active use = persistent loading. The timeout only matters during idle periods.
Verify Model Is Loaded
Check what's currently in memory:
ollama ps
# Or via API:
curl http://localhost:11434/api/ps
Output:
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b 365c0bd3c000 5.1 GB 100% GPU 59 minutes from now
The UNTIL column shows when this specific model will unload. Useful for debugging "why is it cold again?" issues — ollama ps tells you the truth.
Multi-Model Scheduling
If you use multiple models (Llama 3.1 for chat, DeepSeek-Coder for code, Qwen 3 for Korean), the scheduling gets interesting:
Default behavior (OLLAMA_NUM_PARALLEL=1, single model at a time)
- Request to model A loads A
- Request to model B unloads A, loads B
- Per-model
keep_alivedoes NOT prevent eviction when another model is requested
This is the source of most "why is my model cold again?" frustration.
Multi-model concurrent (OLLAMA_MAX_LOADED_MODELS=N)
Set this to keep multiple models simultaneously loaded:
export OLLAMA_MAX_LOADED_MODELS=3
Then up to 3 models stay in VRAM concurrently. The first 3 stay; the 4th request evicts the least recently used (LRU). Combined with a generous OLLAMA_KEEP_ALIVE, you get persistent multi-model serving.
VRAM math becomes critical:
3 × Llama 3.1 8B Q4_K_M @ 4K ctx = 3 × 5.5 GB = 16.5 GB
On a 24 GB card, that fits with room for KV cache growth and a system process. On 11 GB you can only hold one 8B at a time.
Per-model TTL via per-request keep_alive
If you want different models to have different persistence policies, use the per-request keep_alive:
# Coder model — only needed during specific tasks, evict aggressively
requests.post(url, json={"model": "deepseek-coder", "prompt": p, "keep_alive": "5m"})
# Chat model — keep loaded all day
requests.post(url, json={"model": "llama3.1", "prompt": p, "keep_alive": "24h"})
VRAM Math — Old vs New Hardware
For a single 11 GB GTX 1080 Ti:
- Llama 3.1 8B Q4_K_M @ 4K ctx: ~5.5 GB → 1 model at a time
- Llama 3.1 8B Q4_K_M @ 8K ctx: ~6.5 GB → still 1 model
- Two 8B models concurrent: OOM
For 2× 11 GB (22 GB combined, via OLLAMA_SCHED_SPREAD — see Ollama Dual GPU Without NVLink):
- One 14B at Q4_K_M (~9.5 GB): fits with headroom
- Two 8B Q4_K_M concurrently: ~11 GB — fits but tight
- 30B class (Mixtral, Yi-34B): one model only
For 24 GB RTX 3090 / 4090:
- Three 8B models concurrent: ~16-18 GB — comfortable
- 30B + 8B concurrent: viable
Cold Start Time Reality
Measured first-request latency (time to first token after model is cold):
| Model | Size | RTX 3090 cold | GTX 1080 Ti cold |
|---|---|---|---|
| Llama 3.1 8B Q4_K_M | 5 GB | 3-5 s | 8-12 s |
| Llama 3.1 8B Q8_0 | 9 GB | 5-7 s | 14-18 s |
| Qwen 3 14B Q4_K_M | 9 GB | 6-9 s | 18-22 s |
| Mixtral 8×7B Q4_K_M | 27 GB | 18-25 s | n/a (OOM single) |
| Llama 3.1 70B Q4_K_M | 42 GB | 30-50 s (multi-GPU) | n/a |
These add up if you're hitting cold loads multiple times per day. For an interactive Slack bot serving 100 users a day with occasional bursts and gaps: keep_alive=24h might save 20-50 cold loads × ~10s = 200-500s of cumulative latency.
When to Set 0 / Unload Immediately
The use case: shared workstation with multiple users or competing GPU workloads (Stable Diffusion, model training, jupyter notebook with PyTorch).
# Always unload after each request
export OLLAMA_KEEP_ALIVE=0
Or per-request:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "...",
"keep_alive": 0
}'
VRAM is freed within 1-2 seconds of the response completing. Pay cold start on next use. Worth it if other people are losing GPU access otherwise.
When to Set -1 / Indefinite
A dedicated single-user inference box:
export OLLAMA_KEEP_ALIVE=-1
Model stays loaded forever (until ollama stop modelname or service restart). Zero cold starts. Burns ~250-350W idle on most GPUs (vs ~10-50W with model unloaded). Electricity cost vs latency cost tradeoff.
For LocalLLaMA hobby servers running 24/7, this is the right setting. For shared resource, do NOT use -1.
Hidden Behavior — Reloading on Driver/Sleep
Ollama's persistence is process-level. If:
- Your machine sleeps and wakes — model reloads
- Your NVIDIA driver crashes and restarts — model reloads
- Ollama service restarts — model reloads
- The Linux OOM killer terminates Ollama — model reloads (and you should fix your memory pressure)
keep_alive doesn't survive these. If long-term persistence matters, monitor with:
# Watch for unexpected Ollama restarts
journalctl -u ollama -f
API Subtlety — keep_alive with Streaming
When using streaming response:
import requests
with requests.post("http://localhost:11434/api/generate",
json={"model": "llama3.1", "prompt": p, "stream": True, "keep_alive": "1h"},
stream=True) as r:
for line in r.iter_lines():
...
The keep_alive semantics are: from when the response completes (last token streamed), the model stays loaded for the duration. So if a long-form generation takes 30 seconds, the 1-hour timer starts after that 30-second response, not at request submission.
Debug Common Issues
"Model keeps unloading even though I set OLLAMA_KEEP_ALIVE=24h"
Check that the env var is actually set for the Ollama process:
systemctl show ollama --property Environment
# Should show: Environment=OLLAMA_KEEP_ALIVE=24h
If it's not there, your edit didn't take effect. Recheck /etc/systemd/system/ollama.service.d/override.conf exists and has the right content; restart Ollama.
"I set keep_alive on requests but the model still unloads"
Per-request keep_alive only applies if the request actually triggers a model load or use. If the model was already loaded and another request with a shorter keep_alive came in last, that shorter value won.
Also: if you load another model and OLLAMA_MAX_LOADED_MODELS=1, the first model gets evicted regardless of keep_alive.
"ollama ps shows 0% GPU"
Model loaded but not using GPU acceleration. Usually means:
- VRAM was too small for any layer offload (forced to CPU)
- GPU driver problem
- Wrong CUDA_VISIBLE_DEVICES (set to invalid index)
Fix by checking nvidia-smi, ensuring driver+CUDA are healthy, and verifying CUDA_VISIBLE_DEVICES values match real GPU indices.
"Model 'unloads' but VRAM doesn't free"
Some NVIDIA drivers hold VRAM cache after process unload. Usually frees within 30-60 seconds. If it sticks longer than that, you may have a different process still holding it (check with nvidia-smi --query-gpu=index,memory.used --format=csv).
Recommended Configurations by Scenario
Solo hobby user, single model
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1" # Always loaded
Environment="OLLAMA_NUM_PARALLEL=1"
Solo hobby user, swapping between 2-3 models
[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3" # Keep up to 3 concurrent
Environment="OLLAMA_NUM_PARALLEL=1"
Internal tool serving 10-50 users intermittently
[Service]
Environment="OLLAMA_KEEP_ALIVE=2h" # Persistent during workday
Environment="OLLAMA_NUM_PARALLEL=4" # Concurrent serving
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Shared workstation with mixed GPU workloads (SD, training, LLM)
[Service]
Environment="OLLAMA_KEEP_ALIVE=2m" # Aggressively free VRAM
Environment="OLLAMA_NUM_PARALLEL=1"
Plus consider ollama stop modelname after batch jobs to free VRAM immediately.
FAQ
Q: Does keep_alive affect CPU-only inference? Yes — same mechanism, but RAM instead of VRAM. Less critical because RAM is cheaper than VRAM, but still saves cold-load time.
Q: Can I manually unload a model without restarting Ollama? Yes:
ollama stop llama3.1
Or set keep_alive: 0 on the next request to that model.
Q: How is keep_alive different from OLLAMA_NUM_PARALLEL?
KEEP_ALIVE = how long an idle model stays in VRAM. NUM_PARALLEL = how many requests Ollama processes concurrently within one loaded model. Different orthogonal concerns.
Q: Does Ollama support model preloading on startup? Not directly via env var. Workaround: at startup, send a dummy request to each model you want loaded:
for m in llama3.1 deepseek-coder qwen3:14b; do
curl -s http://localhost:11434/api/generate -d "{\"model\": \"$m\", \"prompt\": \"hi\", \"keep_alive\": \"24h\"}" > /dev/null
done
Run this as a systemd ExecStartPost or cron @reboot.
Q: What if my model is bigger than VRAM and partially loads to CPU?
keep_alive still applies — Ollama keeps the model layers loaded across VRAM and RAM until timeout. The cold start time will be longer because more layers need to (re)load.
Q: Does keep_alive impact concurrent users? Indirectly. With keep_alive high and OLLAMA_NUM_PARALLEL=4, a loaded model can serve 4 concurrent users without reload. With keep_alive=0, each request triggers a load (terrible UX for multi-user).
Closing — The Rule
For most LocalLLaMA setups: set OLLAMA_KEEP_ALIVE=24h (or -1 if you have dedicated single-user hardware) and forget about cold starts. Override per-request only when you specifically want eviction (releasing VRAM for another workload).
The default of 5 minutes is too conservative for power users and too aggressive for shared resources. Override it intentionally.
Related posts:
- Running Modern LLMs on GTX 1080 Ti in 2026
- Ollama Dual GPU Without NVLink — Tensor Split on 2× GTX 1080 Ti
- GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M
- Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks
- LLM VRAM Calculator
References:
- Ollama documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md
- Ollama API reference: https://github.com/ollama/ollama/blob/main/docs/api.md
- LocalLLaMA Ollama configuration threads, 2024-2026
관련 글
GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)
5월 27일 · 11 min read
일반Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)
5월 27일 · 13 min read
일반Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM
5월 18일 · 17 min read
일반Best Ollama Models for RTX 3090 24GB in 2026: Real Benchmarks (Qwen3 vs DeepSeek vs Llama)
3월 30일 · 19 min read