Skip to content

oLLM

Overview

oLLM¶

oLLM is a Python library and terminal interface for local LLM inference. It combines:

optimized-native runtimes for built-in aliases when a specialization matches
a generic Transformers-backed path for compatible local or materialized models
runtime inspection so you can see which backend will run, why it was selected, and what the current support level is

Audience¶

operators and end users who want to run prompts and inspect local models
Python developers who want to embed oLLM through RuntimeClient or the low-level optimized-native helpers
contributors who need architecture, verification, and docs-build guidance

Documentation map¶

Getting Started¶

User Guide¶

CLI Reference¶

Library and API¶

Architecture and Contributing¶

Core concepts¶

Model references¶

--model accepts opaque model references, not just a fixed built-in list. Supported forms include:

a built-in alias such as llama3-1B-chat
a Hugging Face repository ID such as Qwen/Qwen2.5-7B-Instruct
a local model directory

Support levels¶

oLLM reports one of three active support levels:

optimized
generic
unsupported

Safety model¶

oLLM intentionally stays conservative in several places:

the generic runtime only loads local or materialized weights from safetensors
unsupported references fail with an explicit reason instead of silently leaving the local runtime boundary
planning and execution report specialization and fallback state explicitly

Quick examples¶

ollm prompt --model llama3-8B-chat "Summarize this file"
ollm prompt --model gemma3-12B --multimodal --image ./diagram.png "Describe this image"
ollm doctor --json
ollm models list

Examples¶

examples/example.py
examples/example_image.py
examples/example_audio.py