oLLM¶
oLLM is a Python library and terminal interface for local LLM inference. It combines:
- optimized-native runtimes for built-in aliases when a specialization matches
- a generic Transformers-backed path for compatible local or materialized models
- runtime inspection so you can see which backend will run, why it was selected, and what the current support level is
Audience¶
- operators and end users who want to run prompts and inspect local models
- Python developers who want to embed oLLM through
RuntimeClientor the low-level optimized-native helpers - contributors who need architecture, verification, and docs-build guidance
Documentation map¶
Getting Started¶
User Guide¶
- Terminal Interface
- Local Server API
- Model References
- Multimodal Workflows
- Runtime Planning and Inspection
- Model Discovery
- Optimization Guide
- Troubleshooting
- Benchmarking
CLI Reference¶
Library and API¶
- High-level Client
- Runtime Configuration
- Optimized-native Helpers
- API Reference
- Local Server API Reference
Architecture and Contributing¶
Core concepts¶
Model references¶
--model accepts opaque model references, not just a fixed built-in list. Supported forms include:
- a built-in alias such as
llama3-1B-chat - a Hugging Face repository ID such as
Qwen/Qwen2.5-7B-Instruct - a local model directory
Support levels¶
oLLM reports one of three active support levels:
optimizedgenericunsupported
Safety model¶
oLLM intentionally stays conservative in several places:
- the generic runtime only loads local or materialized weights from
safetensors - unsupported references fail with an explicit reason instead of silently leaving the local runtime boundary
- planning and execution report specialization and fallback state explicitly
Quick examples¶
ollm prompt --model llama3-8B-chat "Summarize this file"
ollm prompt --model gemma3-12B --multimodal --image ./diagram.png "Describe this image"
ollm doctor --json
ollm models list
Examples¶
examples/example.pyexamples/example_image.pyexamples/example_audio.py