Skip to content

Optimized-native Helpers

Inference and AutoInference remain available for direct optimized-native control.

When to use them

Use these helpers when you specifically want:

  • direct control over the optimized-native path
  • explicit CPU/GPU offload operations through the specialization provider
  • direct access to the native model/tokenizer/processor objects

When not to use them

For new high-level application code, prefer RuntimeClient, because it uses the same resolver, backend selection, and plan inspection model as the CLI.

Typical Inference example

from ollm import Inference, TextStreamer

o = Inference("llama3-1B-chat", device="cuda:0", logging=True)
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2, policy="middle-band")
past_key_values = o.DiskCache(
    cache_dir="./kv_cache/",
    cache_strategy="streamed-segmented",
)
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

When the selected runtime uses kv_cache_strategy="resident", it keeps full-history KV entirely in memory and does not initialize any disk-KV path. When disk-backed strategies are selected, the default path uses an explicit disk-KV store under cache_dir/kv_cache_chunked, with explicit dtype/shape/sequence metadata and raw payloads instead of pickle-backed torch artifacts.

  • kv_cache_strategy="paged" writes to cache_dir/kv_cache_paged and uses an explicit page table.
  • kv_cache_strategy="streamed-segmented" writes to cache_dir/kv_cache_streamed_segmented.
  • kv_cache_strategy="log-structured-journal" writes to cache_dir/kv_cache_log_structured_journal and compacts journal metadata deterministically once the configured entry threshold is reached.
  • kv_cache_strategy="sliding-window-ring-buffer" writes to cache_dir/kv_cache_sliding_window_ring_buffer and retains only the most recent cache_window_tokens cached tokens; older history is evicted under a drop-oldest policy, so this mode deliberately changes runtime semantics.
  • kv_cache_strategy="quantized-cold-tier" writes to cache_dir/kv_cache_quantized_cold_tier and persists colder KV entries in the explicit int8-symmetric-per-tensor representation while dequantizing back to the runtime dtype on load.
  • kv_cache_strategy="tiered-write-back" writes the cold tier to cache_dir/kv_cache_tiered_write_back through a journal-backed append store while keeping a bounded hot tail in memory.

The active runtime then applies a platform/resource-aware buffering or spill policy on top of the selected store. Inference.DiskCache() accepts the same switch through cache_strategy=....

Inference.offload_layers_to_cpu() now also accepts an explicit policy=... argument. The current supported CPU placement policies are auto, prefix, suffix, and middle-band, where auto resolves to middle-band. The runtime currently rejects simultaneous CPU and GPU offload requests instead of pretending mixed policy placement is supported.

Typical AutoInference example

from ollm import AutoInference

o = AutoInference(
    "./models/gemma3-12B",
    adapter_dir="./myadapter/checkpoint-20",
    device="cuda:0",
    multimodality=False,
    logging=True,
)