Skip to content

Configuration

Precedence

oLLM resolves runtime, generation, and server defaults through the same explicit precedence chain:

  1. CLI flags
  2. OLLM_* environment variables
  3. TOML config file values
  4. built-in defaults

By default, oLLM loads ./ollm.toml when it is present. You can point to a different config file with OLLM_CONFIG_FILE=/path/to/ollm.toml.

An example config file that covers runtime, generation, and server settings lives at examples/ollm.toml in the repository root.

Runtime configuration

The CLI and the Python library both build on RuntimeConfig and GenerationConfig.

Key runtime configuration fields:

  • model_reference — the model to resolve
  • models_dir — root for local materialized models
  • device — torch device such as cpu or cuda:0
  • backend — optional backend override
  • multimodal — enable multimodal planning when non-text inputs are expected
  • use_specialization — whether optimized-native specialization is allowed
  • cache_dir / use_cache — KV cache controls
  • strategy_selector_profile — selector profile (balanced, latency, capacity, or bounded-window)
  • kv_cache_strategy — optional explicit KV cache strategy override (resident, chunked, paged, streamed-segmented, log-structured-journal, sliding-window-ring-buffer, quantized-cold-tier, or tiered-write-back)
  • kv_cache_window_tokens — bounded recent-context token budget for sliding-window-ring-buffer; the field is invalid for full-history strategies
  • dense_projection_chunk_rows — optional explicit row budget for dense optimized-native MLP chunking; when omitted, the dense Llama, Gemma3, and Voxtral paths derive smaller chunks only when accelerator headroom is tight
  • kv_cache_lifecycle — whether KV artifacts are runtime-scoped or explicitly persistent; resident requires runtime-scoped
  • kv_cache_adaptation_mode — whether adaptation telemetry is disabled, observe-only, or automatic (live switching is still not enabled)
  • offload_cpu_layers — native CPU offload layer budget when supported
  • offload_cpu_policy — CPU offload placement policy (auto, prefix, suffix, or middle-band)
  • offload_gpu_layers — GPU offload layer budget for mixed-placement-capable runtimes

Current constraint:

  • offload_cpu_layers requires an accelerator runtime device
  • offload_cpu_layers cannot be combined with offload_gpu_layers in the current implementation

Current selector truth:

  • the selector default path is deterministic and table-driven
  • paged, resident, and quantized-cold-tier are the current selector-default candidates
  • sliding-window-ring-buffer stays explicit bounded-history opt-in only
  • streamed-segmented, log-structured-journal, and tiered-write-back stay explicit overrides for now

Generation configuration fields:

  • max_new_tokens
  • temperature
  • top_p
  • top_k
  • seed
  • stream

Server configuration fields:

  • host
  • port
  • reload
  • log_level

See API Reference: Runtime Config.

Environment variables

Nested configuration keys use a double-underscore separator:

  • OLLM_RUNTIME__MODEL_REFERENCE
  • OLLM_RUNTIME__MODELS_DIR
  • OLLM_RUNTIME__DEVICE
  • OLLM_RUNTIME__STRATEGY_SELECTOR_PROFILE
  • OLLM_RUNTIME__KV_CACHE_STRATEGY
  • OLLM_RUNTIME__KV_CACHE_WINDOW_TOKENS
  • OLLM_RUNTIME__DENSE_PROJECTION_CHUNK_ROWS
  • OLLM_RUNTIME__KV_CACHE_LIFECYCLE
  • OLLM_RUNTIME__KV_CACHE_ADAPTATION_MODE
  • OLLM_RUNTIME__OFFLOAD_CPU_LAYERS
  • OLLM_RUNTIME__OFFLOAD_CPU_POLICY
  • OLLM_RUNTIME__OFFLOAD_GPU_LAYERS
  • OLLM_GENERATION__MAX_NEW_TOKENS
  • OLLM_GENERATION__STREAM
  • OLLM_SERVER__HOST
  • OLLM_SERVER__PORT
  • OLLM_SERVER__LOG_LEVEL

Backend override

--backend lets you force one of:

  • optimized-native
  • transformers-generic

The override is validated against the resolved model reference. If the backend is incompatible, oLLM fails with a structured error instead of silently switching to something else.

Specialization toggle

--no-specialization disables optimized-native specialization selection and forces the generic path when one exists.

Important constraint:

  • --backend optimized-native cannot be combined with --no-specialization