일반

Ollama OLLAMA_KEEP_ALIVE — How Model Memory Persistence Actually Works (2026)

Practical deep dive into Ollama's OLLAMA_KEEP_ALIVE — the variable that controls whether your loaded model stays in VRAM or gets unloaded after each request. Covers timeout semantics, multi-model scheduling, the per-request keep_alive parameter, and how to optimize for single-user, multi-user, and shared-VRAM scenarios.

·10 min read
#Ollama#OLLAMA_KEEP_ALIVE#model unload#VRAM management#multi-model serving#Ollama API#model persistence#cold start#RTX 3090#GTX 1080 Ti

Ollama keep alive model memory

Why This Matters More Than You'd Think

Loading an 8-14B GGUF model takes 5-15 seconds. That's per first request, every time the model is cold. If you're poking at Ollama interactively, this is invisible. If you're running a script that hits Ollama every few minutes (a Slack bot, a periodic enrichment job, a custom RAG app), every cold reload is a UX penalty for users and wasted GPU power-up cycles for the rest of us.

OLLAMA_KEEP_ALIVE is the variable that controls model unload behavior. Get it right and your model is always warm. Get it wrong and you either burn VRAM 24/7 or pay the cold-start penalty on every interaction.

This guide is the practical explanation of how it works in 2026 — defaults, edge cases, multi-model scheduling, and the cases where the per-request keep_alive parameter beats the environment variable.

Default Behavior (No Configuration)

When you start ollama serve with no environment variables, the default is:

  • After a request completes, the model stays loaded in VRAM for 5 minutes
  • Then the model is unloaded; next request reloads it (cold start)

This is conservative. Good if you're a multi-user shared workstation where VRAM matters; bad if you're the only user and want models always ready.

Setting OLLAMA_KEEP_ALIVE

Set as an environment variable for ollama serve:

# Examples
export OLLAMA_KEEP_ALIVE=24h     # Keep model loaded for 24 hours after last use
export OLLAMA_KEEP_ALIVE=-1      # Keep loaded indefinitely (until manually unloaded)
export OLLAMA_KEEP_ALIVE=0       # Unload immediately after each request
export OLLAMA_KEEP_ALIVE=30m     # 30 minutes
export OLLAMA_KEEP_ALIVE=5s      # 5 seconds (testing only)

For systemd-managed Ollama:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify it took:

systemctl show ollama --property Environment

Per-Request keep_alive Parameter

The API also accepts a per-request keep_alive value that overrides the environment default for that one request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Hi",
  "keep_alive": "1h"
}'
import requests
r = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.1",
    "prompt": "Summarize this email",
    "keep_alive": "30m",   # this request asks model to stay loaded 30 min after this completes
})

Special values for per-request:

  • 0 or "0" — unload model immediately after this request
  • -1 — keep loaded indefinitely
  • Duration strings: "30s", "5m", "1h", "24h"

What keep_alive Actually Counts From

A common confusion: the timer resets on every request. It doesn't count from when the model was first loaded.

Example with OLLAMA_KEEP_ALIVE=10m:

  • T=0: load Llama 3.1 8B for first prompt → model in VRAM
  • T=5min: another prompt → timer resets to 10 min from now
  • T=12min: timer would have expired at T=15 (5 + 10) — model still loaded
  • T=12min: another prompt → timer resets again
  • T=23min: no requests since T=12 → model unloaded at T=22 (12 + 10)

This means active use = persistent loading. The timeout only matters during idle periods.

Verify Model Is Loaded

Check what's currently in memory:

ollama ps
# Or via API:
curl http://localhost:11434/api/ps

Output:

NAME           ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000    5.1 GB    100% GPU     59 minutes from now

The UNTIL column shows when this specific model will unload. Useful for debugging "why is it cold again?" issues — ollama ps tells you the truth.

Multi-Model Scheduling

If you use multiple models (Llama 3.1 for chat, DeepSeek-Coder for code, Qwen 3 for Korean), the scheduling gets interesting:

Default behavior (OLLAMA_NUM_PARALLEL=1, single model at a time)

  • Request to model A loads A
  • Request to model B unloads A, loads B
  • Per-model keep_alive does NOT prevent eviction when another model is requested

This is the source of most "why is my model cold again?" frustration.

Multi-model concurrent (OLLAMA_MAX_LOADED_MODELS=N)

Set this to keep multiple models simultaneously loaded:

export OLLAMA_MAX_LOADED_MODELS=3

Then up to 3 models stay in VRAM concurrently. The first 3 stay; the 4th request evicts the least recently used (LRU). Combined with a generous OLLAMA_KEEP_ALIVE, you get persistent multi-model serving.

VRAM math becomes critical:

3 × Llama 3.1 8B Q4_K_M @ 4K ctx = 3 × 5.5 GB = 16.5 GB

On a 24 GB card, that fits with room for KV cache growth and a system process. On 11 GB you can only hold one 8B at a time.

Per-model TTL via per-request keep_alive

If you want different models to have different persistence policies, use the per-request keep_alive:

# Coder model — only needed during specific tasks, evict aggressively
requests.post(url, json={"model": "deepseek-coder", "prompt": p, "keep_alive": "5m"})

# Chat model — keep loaded all day
requests.post(url, json={"model": "llama3.1", "prompt": p, "keep_alive": "24h"})

VRAM Math — Old vs New Hardware

For a single 11 GB GTX 1080 Ti:

  • Llama 3.1 8B Q4_K_M @ 4K ctx: ~5.5 GB → 1 model at a time
  • Llama 3.1 8B Q4_K_M @ 8K ctx: ~6.5 GB → still 1 model
  • Two 8B models concurrent: OOM

For 2× 11 GB (22 GB combined, via OLLAMA_SCHED_SPREAD — see Ollama Dual GPU Without NVLink):

  • One 14B at Q4_K_M (~9.5 GB): fits with headroom
  • Two 8B Q4_K_M concurrently: ~11 GB — fits but tight
  • 30B class (Mixtral, Yi-34B): one model only

For 24 GB RTX 3090 / 4090:

  • Three 8B models concurrent: ~16-18 GB — comfortable
  • 30B + 8B concurrent: viable

Cold Start Time Reality

Measured first-request latency (time to first token after model is cold):

ModelSizeRTX 3090 coldGTX 1080 Ti cold
Llama 3.1 8B Q4_K_M5 GB3-5 s8-12 s
Llama 3.1 8B Q8_09 GB5-7 s14-18 s
Qwen 3 14B Q4_K_M9 GB6-9 s18-22 s
Mixtral 8×7B Q4_K_M27 GB18-25 sn/a (OOM single)
Llama 3.1 70B Q4_K_M42 GB30-50 s (multi-GPU)n/a

These add up if you're hitting cold loads multiple times per day. For an interactive Slack bot serving 100 users a day with occasional bursts and gaps: keep_alive=24h might save 20-50 cold loads × ~10s = 200-500s of cumulative latency.

When to Set 0 / Unload Immediately

The use case: shared workstation with multiple users or competing GPU workloads (Stable Diffusion, model training, jupyter notebook with PyTorch).

# Always unload after each request
export OLLAMA_KEEP_ALIVE=0

Or per-request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "...",
  "keep_alive": 0
}'

VRAM is freed within 1-2 seconds of the response completing. Pay cold start on next use. Worth it if other people are losing GPU access otherwise.

When to Set -1 / Indefinite

A dedicated single-user inference box:

export OLLAMA_KEEP_ALIVE=-1

Model stays loaded forever (until ollama stop modelname or service restart). Zero cold starts. Burns ~250-350W idle on most GPUs (vs ~10-50W with model unloaded). Electricity cost vs latency cost tradeoff.

For LocalLLaMA hobby servers running 24/7, this is the right setting. For shared resource, do NOT use -1.

Hidden Behavior — Reloading on Driver/Sleep

Ollama's persistence is process-level. If:

  • Your machine sleeps and wakes — model reloads
  • Your NVIDIA driver crashes and restarts — model reloads
  • Ollama service restarts — model reloads
  • The Linux OOM killer terminates Ollama — model reloads (and you should fix your memory pressure)

keep_alive doesn't survive these. If long-term persistence matters, monitor with:

# Watch for unexpected Ollama restarts
journalctl -u ollama -f

API Subtlety — keep_alive with Streaming

When using streaming response:

import requests
with requests.post("http://localhost:11434/api/generate",
                   json={"model": "llama3.1", "prompt": p, "stream": True, "keep_alive": "1h"},
                   stream=True) as r:
    for line in r.iter_lines():
        ...

The keep_alive semantics are: from when the response completes (last token streamed), the model stays loaded for the duration. So if a long-form generation takes 30 seconds, the 1-hour timer starts after that 30-second response, not at request submission.

Debug Common Issues

"Model keeps unloading even though I set OLLAMA_KEEP_ALIVE=24h"

Check that the env var is actually set for the Ollama process:

systemctl show ollama --property Environment
# Should show: Environment=OLLAMA_KEEP_ALIVE=24h

If it's not there, your edit didn't take effect. Recheck /etc/systemd/system/ollama.service.d/override.conf exists and has the right content; restart Ollama.

"I set keep_alive on requests but the model still unloads"

Per-request keep_alive only applies if the request actually triggers a model load or use. If the model was already loaded and another request with a shorter keep_alive came in last, that shorter value won.

Also: if you load another model and OLLAMA_MAX_LOADED_MODELS=1, the first model gets evicted regardless of keep_alive.

"ollama ps shows 0% GPU"

Model loaded but not using GPU acceleration. Usually means:

  • VRAM was too small for any layer offload (forced to CPU)
  • GPU driver problem
  • Wrong CUDA_VISIBLE_DEVICES (set to invalid index)

Fix by checking nvidia-smi, ensuring driver+CUDA are healthy, and verifying CUDA_VISIBLE_DEVICES values match real GPU indices.

"Model 'unloads' but VRAM doesn't free"

Some NVIDIA drivers hold VRAM cache after process unload. Usually frees within 30-60 seconds. If it sticks longer than that, you may have a different process still holding it (check with nvidia-smi --query-gpu=index,memory.used --format=csv).

Solo hobby user, single model

[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"        # Always loaded
Environment="OLLAMA_NUM_PARALLEL=1"

Solo hobby user, swapping between 2-3 models

[Service]
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"   # Keep up to 3 concurrent
Environment="OLLAMA_NUM_PARALLEL=1"

Internal tool serving 10-50 users intermittently

[Service]
Environment="OLLAMA_KEEP_ALIVE=2h"        # Persistent during workday
Environment="OLLAMA_NUM_PARALLEL=4"        # Concurrent serving
Environment="OLLAMA_MAX_LOADED_MODELS=2"

Shared workstation with mixed GPU workloads (SD, training, LLM)

[Service]
Environment="OLLAMA_KEEP_ALIVE=2m"        # Aggressively free VRAM
Environment="OLLAMA_NUM_PARALLEL=1"

Plus consider ollama stop modelname after batch jobs to free VRAM immediately.

FAQ

Q: Does keep_alive affect CPU-only inference? Yes — same mechanism, but RAM instead of VRAM. Less critical because RAM is cheaper than VRAM, but still saves cold-load time.

Q: Can I manually unload a model without restarting Ollama? Yes:

ollama stop llama3.1

Or set keep_alive: 0 on the next request to that model.

Q: How is keep_alive different from OLLAMA_NUM_PARALLEL? KEEP_ALIVE = how long an idle model stays in VRAM. NUM_PARALLEL = how many requests Ollama processes concurrently within one loaded model. Different orthogonal concerns.

Q: Does Ollama support model preloading on startup? Not directly via env var. Workaround: at startup, send a dummy request to each model you want loaded:

for m in llama3.1 deepseek-coder qwen3:14b; do
  curl -s http://localhost:11434/api/generate -d "{\"model\": \"$m\", \"prompt\": \"hi\", \"keep_alive\": \"24h\"}" > /dev/null
done

Run this as a systemd ExecStartPost or cron @reboot.

Q: What if my model is bigger than VRAM and partially loads to CPU? keep_alive still applies — Ollama keeps the model layers loaded across VRAM and RAM until timeout. The cold start time will be longer because more layers need to (re)load.

Q: Does keep_alive impact concurrent users? Indirectly. With keep_alive high and OLLAMA_NUM_PARALLEL=4, a loaded model can serve 4 concurrent users without reload. With keep_alive=0, each request triggers a load (terrible UX for multi-user).

Closing — The Rule

For most LocalLLaMA setups: set OLLAMA_KEEP_ALIVE=24h (or -1 if you have dedicated single-user hardware) and forget about cold starts. Override per-request only when you specifically want eviction (releasing VRAM for another workload).

The default of 5 minutes is too conservative for power users and too aggressive for shared resources. Override it intentionally.


Related posts:

References:

관련 글