Terminal Interface¶
Core commands¶
ollm # interactive terminal chat
ollm chat # explicit alias for interactive chat
ollm prompt "List planets" # one-shot prompt
ollm doctor --json # environment and runtime diagnostics
ollm models list # built-in and discovered local model references
ollm serve # local-only REST API server
Use ollm or ollm chat only from an interactive terminal. For scripts, pipes, and automation use ollm prompt.
Runtime controls¶
--backendforces a valid local backend for the resolved model reference--no-specializationdisables optimized-native specialization and prefers the generic path when available--plan-jsonprints the resolved runtime plan and exits without running generation
ollm prompt, ollm chat, ollm doctor, and ollm models info all honor these controls.
Configuration sources¶
oLLM now resolves runtime, generation, and server defaults through an explicit layered settings contract:
- CLI flags
OLLM_*environment variables- TOML config file values
- built-in defaults
By default, oLLM checks ./ollm.toml when it is present. You can point to a
different config file with OLLM_CONFIG_FILE=/path/to/ollm.toml.
Nested environment variables use a double-underscore separator:
OLLM_RUNTIME__MODEL_REFERENCEOLLM_RUNTIME__MODELS_DIROLLM_RUNTIME__DEVICEOLLM_RUNTIME__DENSE_PROJECTION_CHUNK_ROWSOLLM_GENERATION__MAX_NEW_TOKENSOLLM_GENERATION__STREAMOLLM_SERVER__PORT
Example config file:
[runtime]
model_reference = "llama3-8B-chat"
models_dir = "models"
device = "cpu"
# dense_projection_chunk_rows = 8192
[generation]
max_new_tokens = 256
temperature = 0.0
stream = true
[server]
host = "127.0.0.1"
port = 8000
response_store_backend = "none"
The current settings surface covers runtime defaults, generation defaults, and future server defaults. Prompt-specific values outside that schema, such as the system prompt text, still remain explicit CLI options today.
Local server¶
Install the optional server transport stack first:
uv sync --extra server
Then start the local-only server:
ollm serve
ollm serve resolves host, port, reload, log_level, and response-store
settings through the
same settings-precedence contract as the rest of the CLI. The default bind is
127.0.0.1, and the server also publishes:
/openapi.json/docs/redoc
The current REST surface is:
GET /v1/healthGET /v1/modelsGET /v1/models/{model_id}POST /v1/chat/completionsPOST /v1/responsesGET /v1/responses/{response_id}DELETE /v1/responses/{response_id}GET /v1/ollm/modelsGET /v1/ollm/models/{model_reference}POST /v1/planPOST /v1/promptPOST /v1/prompt/streamPOST /v1/sessionsGET /v1/sessions/{session_id}POST /v1/sessions/{session_id}/promptPOST /v1/sessions/{session_id}/prompt/stream
The streaming transport is SSE-based and the current server-side sessions are
in-memory only. The OpenAI-compatible /v1/responses surface supports custom
function tools, tool_choice, function_call_output chaining, and typed
function-call SSE events, plus delete/retrieval when a response-store backend is
enabled. See Local Server API for the complete HTTP
surface and CLI ollm serve for command-specific usage.
Model references¶
--model accepts opaque model references. Supported forms include:
- built-in aliases such as
llama3-1B-chatandgemma3-12B - Hugging Face repo IDs such as
Qwen/Qwen2.5-7B-Instruct - local model directories
Provider-prefixed references are rejected so execution stays inside oLLM's local runtime boundary.
Support levels¶
oLLM reports one of three active support levels for a resolved model reference:
optimized— a native specialization provider can run the referencegeneric— the Transformers-backed generic runtime can run the referenceunsupported— the reference resolves, but the current runtime cannot execute it
Discovery and availability terms¶
ollm models list is a discovery view. It combines:
built-inentries shipped by oLLMdiscovered-localentries found under--models-dir
Availability for local references uses:
materializednot-materialized
ollm models list --installed filters to materialized local entries only.
Generic and optimized execution¶
The generic execution path covers compatible local or materialized Transformers-backed:
- causal language models
- encoder-decoder text generation models
- image-text conditional generation models that expose a processor-backed
vision_config
When the resolved model matches a native family specialization (llama, gemma3, qwen3-next, gpt-oss, or voxtral), oLLM records and selects an optimized-native specialization provider through the runtime plan instead of hard-coding model-family branches in Inference.load_model().
Specialization visibility and fallback¶
Planning-only surfaces such as ollm doctor and ollm models info --json expose the resolved backend, specialization state, and planned specialization pass ids without loading a runtime.
Execution surfaces follow the finalized runtime plan. Prompt response metadata includes:
- execution backend
- specialization state
- execution device type for optimized-native runs
- specialization device profile for optimized-native runs
- applied specialization pass ids
- any recorded fallback reason
If an optimized specialization cannot satisfy its planned pass contract and a compatible generic path exists, oLLM falls back safely to transformers-generic instead of pretending the optimized path succeeded.