Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Q: Can I manually unload a model without restarting Ollama?

Yes: ```bash ollama stop llama3.1 ``` Or set `keep_alive: 0` on the next request to that model.

Q: How is keep_alive different from `OLLAMA_NUM_PARALLEL`?

`KEEP_ALIVE` = how long an idle model stays in VRAM. `NUM_PARALLEL` = how many requests Ollama processes concurrently within one loaded model. Different orthogonal concerns.

Q: What if my model is bigger than VRAM and partially loads to CPU?

`keep_alive` still applies — Ollama keeps the model layers loaded across VRAM and RAM until timeout. The cold start time will be longer because more layers need to (re)load.

Ollama keep alive model memory

Why This Matters More Than You'd Think

Loading an 8-14B GGUF model takes 5-15 seconds. That's per first request, every time the model is cold. If you're poking at Ollama interactively, this is invisible. If you're running a script that hits Ollama every few minutes (a Slack bot, a periodic enrichment job, a custom RAG app), every cold reload is a UX penalty for users and wasted GPU power-up cycles for the rest of us.

OLLAMA_KEEP_ALIVE is the variable that controls model unload behavior. Get it right and your model is always warm. Get it wrong and you either burn VRAM 24/7 or pay the cold-start penalty on every interaction.

This guide is the practical explanation of how it works in 2026 — defaults, edge cases, multi-model scheduling, and the cases where the per-request keep_alive parameter beats the environment variable.

Default Behavior (No Configuration)

When you start ollama serve with no environment variables, the default is:

After a request completes, the model stays loaded in VRAM for 5 minutes
Then the model is unloaded; next request reloads it (cold start)

This is conservative. Good if you're a multi-user shared workstation where VRAM matters; bad if you're the only user and want models always ready.

Setting OLLAMA_KEEP_ALIVE

Set as an environment variable for ollama serve:

# Examples
export OLLAMA_KEEP_ALIVE=24h     # Keep model loaded for 24 hours after last use
export OLLAMA_KEEP_ALIVE=-1      # Keep loaded indefinitely (until manually unloaded)
export OLLAMA_KEEP_ALIVE=0       # Unload immediately after each request
export OLLAMA_KEEP_ALIVE=30m     # 30 minutes
export OLLAMA_KEEP_ALIVE=5s      # 5 seconds (testing only)

For systemd-managed Ollama:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify it took:

systemctl show ollama --property Environment

Per-Request keep_alive Parameter

The API also accepts a per-request keep_alive value that overrides the environment default for that one request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Hi",
  "keep_alive": "1h"
}'

import requests
r = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Summarize this email",
    "keep_alive": "30m",   # this request asks model to stay loaded 30 min after this completes
})

Special values for per-request:

0 or "0" — unload model immediately after this request
-1 — keep loaded indefinitely
Duration strings: "30s", "5m", "1h", "24h"

What `keep_alive` Actually Counts From

A common confusion: the timer resets on every request. It doesn't count from when the model was first loaded.

Example with OLLAMA_KEEP_ALIVE=10m:

T=0: load Llama 3.1 8B for first prompt → model in VRAM
T=5min: another prompt → timer resets to 10 min from now
T=12min: timer would have expired at T=15 (5 + 10) — model still loaded
T=12min: another prompt → timer resets again
T=23min: no requests since T=12 → model unloaded at T=22 (12 + 10)

This means active use = persistent loading. The timeout only matters during idle periods.

Verify Model Is Loaded

Check what's currently in memory:

ollama ps
# Or via API:
curl http://localhost:11434/api/ps

Output:

NAME           ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000    5.1 GB    100% GPU     59 minutes from now

The UNTIL column shows when this specific model will unload. Useful for debugging "why is it cold again?" issues — ollama ps tells you the truth.

Multi-Model Scheduling

If you use multiple models (Llama 3.1 for chat, DeepSeek-Coder for code, Qwen 3 for Korean), the scheduling gets interesting:

Default behavior (`OLLAMA_NUM_PARALLEL=1`, single model at a time)

Request to model A loads A
Request to model B unloads A, loads B
Per-model keep_alive does NOT prevent eviction when another model is requested

This is the source of most "why is my model cold again?" frustration.

Multi-model concurrent (`OLLAMA_MAX_LOADED_MODELS=N`)

Set this to keep multiple models simultaneously loaded:

export OLLAMA_MAX_LOADED_MODELS=3

Then up to 3 models stay in VRAM concurrently. The first 3 stay; the 4th request evicts the least recently used (LRU). Combined with a generous OLLAMA_KEEP_ALIVE, you get persistent multi-model serving.

VRAM math becomes critical:

3 × Llama 3.1 8B Q4_K_M @ 4K ctx = 3 × 5.5 GB = 16.5 GB

On a 24 GB card, that fits with room for KV cache growth and a system process. On 11 GB you can only hold one 8B at a time.

Per-model TTL via per-request keep_alive

If you want different models to have different persistence policies, use the per-request keep_alive:

# Coder model — only needed during specific tasks, evict aggressively
requests.post(url, json={"model": "deepseek-coder", "prompt": p, "keep_alive": "5m"})

# Chat model — keep loaded all day
requests.post(url, json={"model": "llama3.1", "prompt": p, "keep_alive": "24h"})

VRAM Math — Old vs New Hardware

For a single 11 GB GTX 1080 Ti:

Llama 3.1 8B Q4_K_M @ 4K ctx: ~5.5 GB → 1 model at a time
Llama 3.1 8B Q4_K_M @ 8K ctx: ~6.5 GB → still 1 model
Two 8B models concurrent: OOM

For 2× 11 GB (22 GB combined, via OLLAMA_SCHED_SPREAD — see Ollama Dual GPU Without NVLink):

One 14B at Q4_K_M (~9.5 GB): fits with headroom
Two 8B Q4_K_M concurrently: ~11 GB — fits but tight
30B class (Mixtral, Yi-34B): one model only

For 24 GB RTX 3090 / 4090:

Three 8B models concurrent: ~16-18 GB — comfortable
30B + 8B concurrent: viable

Cold Start Time Reality

Measured first-request latency (time to first token after model is cold):

Model	Size	RTX 3090 cold	GTX 1080 Ti cold
Llama 3.1 8B Q4_K_M	5 GB	3-5 s	8-12 s
Llama 3.1 8B Q8_0	9 GB	5-7 s	14-18 s
Qwen 3 14B Q4_K_M	9 GB	6-9 s	18-22 s
Mixtral 8×7B Q4_K_M	27 GB	18-25 s	n/a (OOM single)
Llama 3.1 70B Q4_K_M	42 GB	30-50 s (multi-GPU)	n/a

These add up if you're hitting cold loads multiple times per day. For an interactive Slack bot serving 100 users a day with occasional bursts and gaps: keep_alive=24h might save 20-50 cold loads × ~10s = 200-500s of cumulative latency.

When to Set 0 / Unload Immediately

The use case: shared workstation with multiple users or competing GPU workloads (Stable Diffusion, model training, jupyter notebook with PyTorch).

# Always unload after each request
export OLLAMA_KEEP_ALIVE=0

Or per-request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "...",
  "keep_alive": 0
}'

VRAM is freed within 1-2 seconds of the response completing. Pay cold start on next use. Worth it if other people are losing GPU access otherwise.

When to Set -1 / Indefinite

A dedicated single-user inference box:

export OLLAMA_KEEP_ALIVE=-1

Model stays loaded forever (until ollama stop modelname or service restart). Zero cold starts. Burns ~250-350W idle on most GPUs (vs ~10-50W with model unloaded). Electricity cost vs latency cost tradeoff.

For LocalLLaMA hobby servers running 24/7, this is the right setting. For shared resource, do NOT use -1.

Hidden Behavior — Reloading on Driver/Sleep

Ollama's persistence is process-level. If:

Your machine sleeps and wakes — model reloads
Your NVIDIA driver crashes and restarts — model reloads
Ollama service restarts — model reloads
The Linux OOM killer terminates Ollama — model reloads (and you should fix your memory pressure)

keep_alive doesn't survive these. If long-term persistence matters, monitor with:

# Watch for unexpected Ollama restarts
journalctl -u ollama -f

API Subtlety — keep_alive with Streaming

When using streaming response:

import requests
with requests.post("http://localhost:11434/api/generate",
                   json={"model": "llama3.1", "prompt": p, "stream": True, "keep_alive": "1h"},
                   stream=True) as r:
    for line in r.iter_lines():
        ...

The keep_alive semantics are: from when the response completes (last token streamed), the model stays loaded for the duration. So if a long-form generation takes 30 seconds, the 1-hour timer starts after that 30-second response, not at request submission.

Debug Common Issues

"Model keeps unloading even though I set OLLAMA_KEEP_ALIVE=24h"

Check that the env var is actually set for the Ollama process:

systemctl show ollama --property Environment
# Should show: Environment=OLLAMA_KEEP_ALIVE=24h

If it's not there, your edit didn't take effect. Recheck /etc/systemd/system/ollama.service.d/override.conf exists and has the right content; restart Ollama.

"I set keep_alive on requests but the model still unloads"

Per-request keep_alive only applies if the request actually triggers a model load or use. If the model was already loaded and another request with a shorter keep_alive came in last, that shorter value won.

Also: if you load another model and OLLAMA_MAX_LOADED_MODELS=1, the first model gets evicted regardless of keep_alive.

"ollama ps shows 0% GPU"

Model loaded but not using GPU acceleration. Usually means:

VRAM was too small for any layer offload (forced to CPU)
GPU driver problem
Wrong CUDA_VISIBLE_DEVICES (set to invalid index)

Fix by checking nvidia-smi, ensuring driver+CUDA are healthy, and verifying CUDA_VISIBLE_DEVICES values match real GPU indices.

"Model 'unloads' but VRAM doesn't free"

Some NVIDIA drivers hold VRAM cache after process unload. Usually frees within 30-60 seconds. If it sticks longer than that, you may have a different process still holding it (check with nvidia-smi --query-gpu=index,memory.used --format=csv).

Recommended Configurations by Scenario

Solo hobby user, single model

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"        # Always loaded
Environment="OLLAMA_NUM_PARALLEL=1"

Solo hobby user, swapping between 2-3 models

[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"   # Keep up to 3 concurrent
Environment="OLLAMA_NUM_PARALLEL=1"

Internal tool serving 10-50 users intermittently

[Service]
Environment="OLLAMA_KEEP_ALIVE=2h"        # Persistent during workday
Environment="OLLAMA_NUM_PARALLEL=4"        # Concurrent serving
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Shared workstation with mixed GPU workloads (SD, training, LLM)

[Service]
Environment="OLLAMA_KEEP_ALIVE=2m"        # Aggressively free VRAM
Environment="OLLAMA_NUM_PARALLEL=1"

Plus consider ollama stop modelname after batch jobs to free VRAM immediately.

FAQ

Q: Does keep_alive affect CPU-only inference? Yes — same mechanism, but RAM instead of VRAM. Less critical because RAM is cheaper than VRAM, but still saves cold-load time.

Q: Can I manually unload a model without restarting Ollama? Yes:

ollama stop llama3.1

Or set keep_alive: 0 on the next request to that model.

Q: How is keep_alive different from OLLAMA_NUM_PARALLEL? KEEP_ALIVE = how long an idle model stays in VRAM. NUM_PARALLEL = how many requests Ollama processes concurrently within one loaded model. Different orthogonal concerns.

Q: Does Ollama support model preloading on startup? Not directly via env var. Workaround: at startup, send a dummy request to each model you want loaded:

for m in llama3.1 deepseek-coder qwen3:14b; do
  curl -s http://localhost:11434/api/generate -d "{\"model\": \"$m\", \"prompt\": \"hi\", \"keep_alive\": \"24h\"}" > /dev/null
done

Run this as a systemd ExecStartPost or cron @reboot.

Q: What if my model is bigger than VRAM and partially loads to CPU? keep_alive still applies — Ollama keeps the model layers loaded across VRAM and RAM until timeout. The cold start time will be longer because more layers need to (re)load.

Q: Does keep_alive impact concurrent users? Indirectly. With keep_alive high and OLLAMA_NUM_PARALLEL=4, a loaded model can serve 4 concurrent users without reload. With keep_alive=0, each request triggers a load (terrible UX for multi-user).

Closing — The Rule

For most LocalLLaMA setups: set OLLAMA_KEEP_ALIVE=24h (or -1 if you have dedicated single-user hardware) and forget about cold starts. Override per-request only when you specifically want eviction (releasing VRAM for another workload).

The default of 5 minutes is too conservative for power users and too aggressive for shared resources. Override it intentionally.

Related posts:

References:

Ollama documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md
Ollama API reference: https://github.com/ollama/ollama/blob/main/docs/api.md
LocalLLaMA Ollama configuration threads, 2024-2026

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Why This Matters More Than You'd Think

Default Behavior (No Configuration)

Setting OLLAMA_KEEP_ALIVE

Per-Request keep_alive Parameter

What `keep_alive` Actually Counts From

Verify Model Is Loaded

Multi-Model Scheduling

Default behavior (`OLLAMA_NUM_PARALLEL=1`, single model at a time)

Multi-model concurrent (`OLLAMA_MAX_LOADED_MODELS=N`)

Per-model TTL via per-request keep_alive

VRAM Math — Old vs New Hardware

Cold Start Time Reality

When to Set 0 / Unload Immediately

When to Set -1 / Indefinite

Hidden Behavior — Reloading on Driver/Sleep

API Subtlety — keep_alive with Streaming

Debug Common Issues

"Model keeps unloading even though I set OLLAMA_KEEP_ALIVE=24h"

"I set keep_alive on requests but the model still unloads"

"ollama ps shows 0% GPU"

"Model 'unloads' but VRAM doesn't free"

Recommended Configurations by Scenario

Solo hobby user, single model

Solo hobby user, swapping between 2-3 models

Internal tool serving 10-50 users intermittently

Shared workstation with mixed GPU workloads (SD, training, LLM)

FAQ

Closing — The Rule

관련 글

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM

Why This Matters More Than You'd Think

Default Behavior (No Configuration)

Setting OLLAMA_KEEP_ALIVE

Per-Request keep_alive Parameter

What keep_alive Actually Counts From

Verify Model Is Loaded

Multi-Model Scheduling

Default behavior (OLLAMA_NUM_PARALLEL=1, single model at a time)

Multi-model concurrent (OLLAMA_MAX_LOADED_MODELS=N)

Per-model TTL via per-request keep_alive

VRAM Math — Old vs New Hardware

Cold Start Time Reality

When to Set 0 / Unload Immediately

When to Set -1 / Indefinite

Hidden Behavior — Reloading on Driver/Sleep

API Subtlety — keep_alive with Streaming

Debug Common Issues

"Model keeps unloading even though I set OLLAMA_KEEP_ALIVE=24h"

"I set keep_alive on requests but the model still unloads"

"ollama ps shows 0% GPU"

"Model 'unloads' but VRAM doesn't free"

Recommended Configurations by Scenario

Solo hobby user, single model

Solo hobby user, swapping between 2-3 models

Internal tool serving 10-50 users intermittently

Shared workstation with mixed GPU workloads (SD, training, LLM)

FAQ

Closing — The Rule

관련 글

GGUF Quantization Showdown — Q4_K_M vs Q4_K_S vs IQ4_XS vs Q5_K_M (2026 Real Quality + Speed)

Building a Fully-Local Research RAG on 2× GTX 1080 Ti + an RTX 3090: 3 Gotchas (CPU Embeddings, the Context Trap, and Not Merging GPUs)

Running Qwen3.6-35B-A3B on RTX 3090 24GB — Real Use Cases for the 3B-Active MoE (2026)

Ollama vs LM Studio vs llama.cpp: Honest 2026 Comparison for Local LLM

What `keep_alive` Actually Counts From

Default behavior (`OLLAMA_NUM_PARALLEL=1`, single model at a time)

Multi-model concurrent (`OLLAMA_MAX_LOADED_MODELS=N`)