Optimized-native Helpers¶

Inference and AutoInference remain available for direct optimized-native control.

When to use them¶

Use these helpers when you specifically want:

direct control over the optimized-native path
explicit CPU/GPU offload operations through the specialization provider
direct access to the native model/tokenizer/processor objects

When not to use them¶

For new high-level application code, prefer RuntimeClient, because it uses the same resolver, backend selection, and plan inspection model as the CLI.

Typical `Inference` example¶

from ollm import Inference, TextStreamer

o = Inference("llama3-1B-chat", device="cuda:0", logging=True)
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2, policy="middle-band")
past_key_values = o.DiskCache(
    cache_dir="./kv_cache/",
    cache_strategy="streamed-segmented",
)
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

When the selected runtime uses kv_cache_strategy="resident", it keeps full-history KV entirely in memory and does not initialize any disk-KV path. When disk-backed strategies are selected, the default path uses an explicit disk-KV store under cache_dir/kv_cache_chunked, with explicit dtype/shape/sequence metadata and raw payloads instead of pickle-backed torch artifacts.

kv_cache_strategy="paged" writes to cache_dir/kv_cache_paged and uses an explicit page table.
kv_cache_strategy="streamed-segmented" writes to cache_dir/kv_cache_streamed_segmented.
kv_cache_strategy="log-structured-journal" writes to cache_dir/kv_cache_log_structured_journal and compacts journal metadata deterministically once the configured entry threshold is reached.
kv_cache_strategy="sliding-window-ring-buffer" writes to cache_dir/kv_cache_sliding_window_ring_buffer and retains only the most recent cache_window_tokens cached tokens; older history is evicted under a drop-oldest policy, so this mode deliberately changes runtime semantics.
kv_cache_strategy="quantized-cold-tier" writes to cache_dir/kv_cache_quantized_cold_tier and persists colder KV entries in the explicit int8-symmetric-per-tensor representation while dequantizing back to the runtime dtype on load.
kv_cache_strategy="tiered-write-back" writes the cold tier to cache_dir/kv_cache_tiered_write_back through a journal-backed append store while keeping a bounded hot tail in memory.

The active runtime then applies a platform/resource-aware buffering or spill policy on top of the selected store. Inference.DiskCache() accepts the same switch through cache_strategy=....

Inference.offload_layers_to_cpu() now also accepts an explicit policy=... argument. The current supported CPU placement policies are auto, prefix, suffix, and middle-band, where auto resolves to middle-band. The runtime currently rejects simultaneous CPU and GPU offload requests instead of pretending mixed policy placement is supported.

Typical `AutoInference` example¶

from ollm import AutoInference

o = AutoInference(
    "./models/gemma3-12B",
    adapter_dir="./myadapter/checkpoint-20",
    device="cuda:0",
    multimodality=False,
    logging=True,
)

Optimized-native Helpers¶

When to use them¶

When not to use them¶

Typical Inference example¶

Typical AutoInference example¶

Typical `Inference` example¶

Typical `AutoInference` example¶