Skip to content

Terminal Interface

Core commands

ollm                         # interactive terminal chat
ollm chat                    # explicit alias for interactive chat
ollm prompt "List planets"   # one-shot prompt
ollm doctor --json           # environment and runtime diagnostics
ollm models list             # built-in and discovered local model references
ollm serve                   # local-only REST API server

Use ollm or ollm chat only from an interactive terminal. For scripts, pipes, and automation use ollm prompt.

Runtime controls

  • --backend forces a valid local backend for the resolved model reference
  • --no-specialization disables optimized-native specialization and prefers the generic path when available
  • --plan-json prints the resolved runtime plan and exits without running generation

ollm prompt, ollm chat, ollm doctor, and ollm models info all honor these controls.

Configuration sources

oLLM now resolves runtime, generation, and server defaults through an explicit layered settings contract:

  1. CLI flags
  2. OLLM_* environment variables
  3. TOML config file values
  4. built-in defaults

By default, oLLM checks ./ollm.toml when it is present. You can point to a different config file with OLLM_CONFIG_FILE=/path/to/ollm.toml.

Nested environment variables use a double-underscore separator:

  • OLLM_RUNTIME__MODEL_REFERENCE
  • OLLM_RUNTIME__MODELS_DIR
  • OLLM_RUNTIME__DEVICE
  • OLLM_RUNTIME__DENSE_PROJECTION_CHUNK_ROWS
  • OLLM_GENERATION__MAX_NEW_TOKENS
  • OLLM_GENERATION__STREAM
  • OLLM_SERVER__PORT

Example config file:

[runtime]
model_reference = "llama3-8B-chat"
models_dir = "models"
device = "cpu"
# dense_projection_chunk_rows = 8192

[generation]
max_new_tokens = 256
temperature = 0.0
stream = true

[server]
host = "127.0.0.1"
port = 8000
response_store_backend = "none"

The current settings surface covers runtime defaults, generation defaults, and future server defaults. Prompt-specific values outside that schema, such as the system prompt text, still remain explicit CLI options today.

Local server

Install the optional server transport stack first:

uv sync --extra server

Then start the local-only server:

ollm serve

ollm serve resolves host, port, reload, log_level, and response-store settings through the same settings-precedence contract as the rest of the CLI. The default bind is 127.0.0.1, and the server also publishes:

  • /openapi.json
  • /docs
  • /redoc

The current REST surface is:

  • GET /v1/health
  • GET /v1/models
  • GET /v1/models/{model_id}
  • POST /v1/chat/completions
  • POST /v1/responses
  • GET /v1/responses/{response_id}
  • DELETE /v1/responses/{response_id}
  • GET /v1/ollm/models
  • GET /v1/ollm/models/{model_reference}
  • POST /v1/plan
  • POST /v1/prompt
  • POST /v1/prompt/stream
  • POST /v1/sessions
  • GET /v1/sessions/{session_id}
  • POST /v1/sessions/{session_id}/prompt
  • POST /v1/sessions/{session_id}/prompt/stream

The streaming transport is SSE-based and the current server-side sessions are in-memory only. The OpenAI-compatible /v1/responses surface supports custom function tools, tool_choice, function_call_output chaining, and typed function-call SSE events, plus delete/retrieval when a response-store backend is enabled. See Local Server API for the complete HTTP surface and CLI ollm serve for command-specific usage.

Model references

--model accepts opaque model references. Supported forms include:

  • built-in aliases such as llama3-1B-chat and gemma3-12B
  • Hugging Face repo IDs such as Qwen/Qwen2.5-7B-Instruct
  • local model directories

Provider-prefixed references are rejected so execution stays inside oLLM's local runtime boundary.

Support levels

oLLM reports one of three active support levels for a resolved model reference:

  • optimized — a native specialization provider can run the reference
  • generic — the Transformers-backed generic runtime can run the reference
  • unsupported — the reference resolves, but the current runtime cannot execute it

Discovery and availability terms

ollm models list is a discovery view. It combines:

  • built-in entries shipped by oLLM
  • discovered-local entries found under --models-dir

Availability for local references uses:

  • materialized
  • not-materialized

ollm models list --installed filters to materialized local entries only.

Generic and optimized execution

The generic execution path covers compatible local or materialized Transformers-backed:

  • causal language models
  • encoder-decoder text generation models
  • image-text conditional generation models that expose a processor-backed vision_config

When the resolved model matches a native family specialization (llama, gemma3, qwen3-next, gpt-oss, or voxtral), oLLM records and selects an optimized-native specialization provider through the runtime plan instead of hard-coding model-family branches in Inference.load_model().

Specialization visibility and fallback

Planning-only surfaces such as ollm doctor and ollm models info --json expose the resolved backend, specialization state, and planned specialization pass ids without loading a runtime.

Execution surfaces follow the finalized runtime plan. Prompt response metadata includes:

  • execution backend
  • specialization state
  • execution device type for optimized-native runs
  • specialization device profile for optimized-native runs
  • applied specialization pass ids
  • any recorded fallback reason

If an optimized specialization cannot satisfy its planned pass contract and a compatible generic path exists, oLLM falls back safely to transformers-generic instead of pretending the optimized path succeeded.