Configuration¶
Precedence¶
oLLM resolves runtime, generation, and server defaults through the same explicit precedence chain:
- CLI flags
OLLM_*environment variables- TOML config file values
- built-in defaults
By default, oLLM loads ./ollm.toml when it is present. You can point to a
different config file with OLLM_CONFIG_FILE=/path/to/ollm.toml.
An example config file that covers runtime, generation, and server settings
lives at examples/ollm.toml in the repository root.
Runtime configuration¶
The CLI and the Python library both build on RuntimeConfig and GenerationConfig.
Key runtime configuration fields:
model_reference— the model to resolvemodels_dir— root for local materialized modelsdevice— torch device such ascpuorcuda:0backend— optional backend overridemultimodal— enable multimodal planning when non-text inputs are expecteduse_specialization— whether optimized-native specialization is allowedcache_dir/use_cache— KV cache controlsstrategy_selector_profile— selector profile (balanced,latency,capacity, orbounded-window)kv_cache_strategy— optional explicit KV cache strategy override (resident,chunked,paged,streamed-segmented,log-structured-journal,sliding-window-ring-buffer,quantized-cold-tier, ortiered-write-back)kv_cache_window_tokens— bounded recent-context token budget forsliding-window-ring-buffer; the field is invalid for full-history strategiesdense_projection_chunk_rows— optional explicit row budget for dense optimized-native MLP chunking; when omitted, the dense Llama, Gemma3, and Voxtral paths derive smaller chunks only when accelerator headroom is tightkv_cache_lifecycle— whether KV artifacts areruntime-scopedor explicitlypersistent;residentrequiresruntime-scopedkv_cache_adaptation_mode— whether adaptation telemetry isdisabled,observe-only, orautomatic(live switching is still not enabled)offload_cpu_layers— native CPU offload layer budget when supportedoffload_cpu_policy— CPU offload placement policy (auto,prefix,suffix, ormiddle-band)offload_gpu_layers— GPU offload layer budget for mixed-placement-capable runtimes
Current constraint:
offload_cpu_layersrequires an accelerator runtime deviceoffload_cpu_layerscannot be combined withoffload_gpu_layersin the current implementation
Current selector truth:
- the selector default path is deterministic and table-driven
paged,resident, andquantized-cold-tierare the current selector-default candidatessliding-window-ring-bufferstays explicit bounded-history opt-in onlystreamed-segmented,log-structured-journal, andtiered-write-backstay explicit overrides for now
Generation configuration fields:
max_new_tokenstemperaturetop_ptop_kseedstream
Server configuration fields:
hostportreloadlog_level
See API Reference: Runtime Config.
Environment variables¶
Nested configuration keys use a double-underscore separator:
OLLM_RUNTIME__MODEL_REFERENCEOLLM_RUNTIME__MODELS_DIROLLM_RUNTIME__DEVICEOLLM_RUNTIME__STRATEGY_SELECTOR_PROFILEOLLM_RUNTIME__KV_CACHE_STRATEGYOLLM_RUNTIME__KV_CACHE_WINDOW_TOKENSOLLM_RUNTIME__DENSE_PROJECTION_CHUNK_ROWSOLLM_RUNTIME__KV_CACHE_LIFECYCLEOLLM_RUNTIME__KV_CACHE_ADAPTATION_MODEOLLM_RUNTIME__OFFLOAD_CPU_LAYERSOLLM_RUNTIME__OFFLOAD_CPU_POLICYOLLM_RUNTIME__OFFLOAD_GPU_LAYERSOLLM_GENERATION__MAX_NEW_TOKENSOLLM_GENERATION__STREAMOLLM_SERVER__HOSTOLLM_SERVER__PORTOLLM_SERVER__LOG_LEVEL
Backend override¶
--backend lets you force one of:
optimized-nativetransformers-generic
The override is validated against the resolved model reference. If the backend is incompatible, oLLM fails with a structured error instead of silently switching to something else.
Specialization toggle¶
--no-specialization disables optimized-native specialization selection and forces the generic path when one exists.
Important constraint:
--backend optimized-nativecannot be combined with--no-specialization