Architecture Overview¶

oLLM is organized around a resolver-and-plan model instead of a fixed allowlist gate. The same internal flow powers the CLI, the Python library, and the local server surface.

High-level flow¶

flowchart LR A[User input
CLI / Python / Server] --> B[ModelReference.parse] B --> C[ModelResolver] C --> D[Capability discovery] D --> E[BackendSelector] E --> F[RuntimeLoader.plan] F --> G[Strategy selector] G --> H{Inspect only?} H -->|Yes| I[RuntimePlan / JSON inspection] H -->|No| J[RuntimeLoader.load] J --> K[ApplicationService / RuntimeClient] K --> L[RuntimeExecutor]

Parse a user-facing model reference.
Resolve it into a normalized ResolvedModel.
Discover capability and support metadata.
Select a backend and support level.
Refine the runtime plan through backend-specific probes.
Apply deterministic pre-run strategy selection for KV behavior.
Either expose the plan as inspection output or load the runtime.
Execute through the shared application/runtime stack.

Main subsystems¶

Subsystem	Responsibility
`ModelReference`	Parses opaque model input such as aliases, Hugging Face IDs, and local paths.
`ModelResolver`	Normalizes a reference into a `ResolvedModel` with capability and source metadata.
Capability discovery	Inspects local model artifacts when built-in metadata is not enough.
`BackendSelector`	Chooses the most truthful backend and support level for the current runtime config.
`RuntimePlan`	Carries the resolved backend choice, specialization state, and inspection details.
`RuntimeLoader`	Materializes models when needed, plans execution, and loads the backend runtime.
`ApplicationService`	Provides the shared control-plane surface used by CLI and server transports.
`RuntimeExecutor`	Executes prompt requests once a runtime has been loaded.
Specialization registry	Matches and applies optimized-native providers and passes.

Why this shape exists¶

This layering keeps model resolution, planning, loading, and execution separate. That lets oLLM answer "what would run and why?" before it loads weights, which is why ollm prompt --plan-json, ollm models info, and the server planning surfaces can stay truthful without forcing a full runtime load.