Local Server API¶
The optional server extra exposes a local-only FastAPI application on top of the same planning and execution stack used by the CLI.
Start the server¶
uv sync --extra server
ollm serve
The default bind is 127.0.0.1:8000.
Request flow¶
flowchart LR
A[HTTP client] --> B[FastAPI route]
B --> C[ApplicationService]
C --> D[RuntimeLoader plan/load]
D --> E[RuntimeExecutor]
B --> F[Session store]
B --> G[OpenAI response store]
The server is deliberately thin. It reuses the same planning and execution stack as the CLI, then layers HTTP transport, session state, and optional Responses storage on top.
OpenAPI surfaces¶
The local server publishes:
/openapi.jsonfor machine-readable schema access/docsfor Swagger UI/redocfor ReDoc
Route groups¶
OpenAI-compatible routes¶
| Route | Purpose |
|---|---|
GET /v1/models |
List model IDs in OpenAI-compatible shape. |
GET /v1/models/{model_id} |
Inspect one model in OpenAI-compatible shape. |
POST /v1/chat/completions |
Run text chat completions. |
POST /v1/responses |
Run the OpenAI-compatible Responses surface. |
GET /v1/responses/{response_id} |
Retrieve a stored response when response storage is enabled. |
DELETE /v1/responses/{response_id} |
Delete a stored response when response storage is enabled. |
Native oLLM routes¶
| Route | Purpose |
|---|---|
GET /v1/health |
Basic service metadata and mode health. |
GET /v1/ollm/models |
Native model discovery and filtering. |
GET /v1/ollm/models/{model_reference} |
Native model inspection with runtime details. |
POST /v1/plan |
Runtime planning without prompt execution. |
POST /v1/prompt |
One-shot native prompt execution. |
POST /v1/prompt/stream |
Native SSE prompt streaming. |
POST /v1/sessions |
Create an in-memory session. |
GET /v1/sessions/{session_id} |
Inspect a session transcript. |
POST /v1/sessions/{session_id}/prompt |
Prompt within a session. |
POST /v1/sessions/{session_id}/prompt/stream |
Stream a session prompt. |
Behavior notes¶
- The bind is local-only by default.
- The OpenAI-compatible surface currently covers model discovery, text chat completions, and text responses.
- Chat-completions requests currently support plain string content and structured text-part arrays only.
- Responses requests support plain string input, message arrays with text/image/audio
and file-reference content parts,
function_call_outputtool-result items, and customtype=functiontools withtool_choice. - OpenAI-compatible chat streaming uses
text/event-streamwith chat-completion chunks and a finaldata: [DONE]marker. - Responses streaming uses typed SSE events such as
response.created,response.in_progress,response.output_item.added,response.content_part.added,response.output_text.delta,response.output_text.done,response.content_part.done,response.function_call_arguments.delta,response.function_call_arguments.done,response.output_item.done,response.completed, andresponse.failed. - Native prompt streaming continues to use the oLLM-specific SSE event shape.
- Server-side sessions are in-memory only in the current slice.
- Responses storage is disabled by default. Configure a response-store backend
if you want
GET /v1/responses/{response_id},DELETE /v1/responses/{response_id}, orprevious_response_idchaining. - Runtime and generation defaults still follow the standard config layering contract for native endpoints.
Example requests¶
curl http://127.0.0.1:8000/v1/health
curl http://127.0.0.1:8000/v1/models
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "content-type: application/json" \
-d '{
"model": "llama3-1B-chat",
"messages": [{"role": "user", "content": "List three planets"}]
}'
curl -N -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "content-type: application/json" \
-d '{
"model": "llama3-1B-chat",
"stream": true,
"messages": [{"role": "user", "content": "List three planets"}]
}'
curl -X POST http://127.0.0.1:8000/v1/responses \
-H "content-type: application/json" \
-d '{
"model": "llama3-1B-chat",
"instructions": "Be brief.",
"input": "List three planets"
}'
curl -N -X POST http://127.0.0.1:8000/v1/responses \
-H "content-type: application/json" \
-d '{
"model": "llama3-1B-chat",
"stream": true,
"input": "List three planets"
}'
curl -X POST http://127.0.0.1:8000/v1/responses \
-H "content-type: application/json" \
-d '{
"model": "llama3-1B-chat",
"input": "What is the weather in Paris?",
"tools": [{
"type": "function",
"name": "get_weather",
"description": "Look up current weather",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
}'
curl -X POST http://127.0.0.1:8000/v1/plan \
-H "content-type: application/json" \
-d '{"runtime":{"model_reference":"llama3-1B-chat"}}'
Use examples/ollm.toml from the repository root as a starting point when you
want shared CLI and server defaults.