Optimized-native Helpers¶
Inference and AutoInference remain available for direct optimized-native control.
When to use them¶
Use these helpers when you specifically want:
- direct control over the optimized-native path
- explicit CPU/GPU offload operations through the specialization provider
- direct access to the native model/tokenizer/processor objects
When not to use them¶
For new high-level application code, prefer RuntimeClient, because it uses the same resolver, backend selection, and plan inspection model as the CLI.
Typical Inference example¶
from ollm import Inference, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True)
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2, policy="middle-band")
past_key_values = o.DiskCache(
cache_dir="./kv_cache/",
cache_strategy="streamed-segmented",
)
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)
When the selected runtime uses kv_cache_strategy="resident", it keeps
full-history KV entirely in memory and does not initialize any disk-KV path.
When disk-backed strategies are selected, the default path uses an explicit
disk-KV store under cache_dir/kv_cache_chunked, with explicit
dtype/shape/sequence metadata and raw payloads instead of pickle-backed torch
artifacts.
kv_cache_strategy="paged"writes tocache_dir/kv_cache_pagedand uses an explicit page table.kv_cache_strategy="streamed-segmented"writes tocache_dir/kv_cache_streamed_segmented.kv_cache_strategy="log-structured-journal"writes tocache_dir/kv_cache_log_structured_journaland compacts journal metadata deterministically once the configured entry threshold is reached.kv_cache_strategy="sliding-window-ring-buffer"writes tocache_dir/kv_cache_sliding_window_ring_bufferand retains only the most recentcache_window_tokenscached tokens; older history is evicted under adrop-oldestpolicy, so this mode deliberately changes runtime semantics.kv_cache_strategy="quantized-cold-tier"writes tocache_dir/kv_cache_quantized_cold_tierand persists colder KV entries in the explicitint8-symmetric-per-tensorrepresentation while dequantizing back to the runtime dtype on load.kv_cache_strategy="tiered-write-back"writes the cold tier tocache_dir/kv_cache_tiered_write_backthrough a journal-backed append store while keeping a bounded hot tail in memory.
The active runtime then applies a platform/resource-aware buffering or spill
policy on top of the selected store. Inference.DiskCache() accepts the same
switch through cache_strategy=....
Inference.offload_layers_to_cpu() now also accepts an explicit
policy=... argument. The current supported CPU placement policies are
auto, prefix, suffix, and middle-band, where auto resolves to
middle-band. The runtime currently rejects simultaneous CPU and GPU offload
requests instead of pretending mixed policy placement is supported.
Typical AutoInference example¶
from ollm import AutoInference
o = AutoInference(
"./models/gemma3-12B",
adapter_dir="./myadapter/checkpoint-20",
device="cuda:0",
multimodality=False,
logging=True,
)