Optimization Guide¶

This guide explains how oLLM's optimized-native path differs from the generic runtime and how to interpret specialization behavior.

Runtime tiers¶

Optimized-native¶

Used when a built-in alias or compatible native-family local model matches a specialization provider.

Current native families:

llama
gemma3
qwen3-next
gpt-oss
voxtral

Optimized-native decoder-only text prompts now use bounded chunked prefill for long prompt ingestion before the final decode step. This is a memory-control path, not a blanket latency optimization, so prompt-scaling benchmarks are the truthful way to evaluate whether the chunking tradeoff helps on a given host.

Transformers-generic¶

Used for compatible local or materialized models that can run through the generic Transformers-backed path.

Specialization passes¶

Optimized-native planning can record reusable passes such as:

disk-cache
cpu-offload
gpu-offload
mlp-chunking
moe-routing
attention-replacement
multimodal-shell
gds-export-weights

These passes are validated against the assembled optimized runtime before execution proceeds.

mlp-chunking is currently the dense-sub-layer feasibility guard on the optimized-native Llama, Gemma3, and Voxtral paths. The runtime keeps the existing 16384-row ceiling as the normal fast path, derives smaller chunks when accelerator headroom is tight, and accepts an explicit dense_projection_chunk_rows override through RuntimeConfig, ollm.toml, or OLLM_RUNTIME__DENSE_PROJECTION_CHUNK_ROWS. Smaller chunks reduce peak activation pressure but can shift latency differently across devices, so this path should be treated as a fallback-first memory control rather than a general latency optimization.

Fallback behavior¶

If an optimized specialization cannot satisfy its planned pass contract and a compatible generic Transformers path exists, oLLM falls back safely to transformers-generic.

This is intentional. The runtime does not pretend the optimized path succeeded.

Offload and cache controls¶

When supported by the selected backend, oLLM can expose:

disk KV cache
CPU layer offload
mixed GPU / CPU layer placement

These controls are backend-dependent. The generic path does not expose the same low-level layer-placement controls as optimized-native runtimes.

Current CPU offload policies are:

auto
prefix
suffix
middle-band

auto currently resolves to middle-band. Simultaneous CPU and GPU offload is intentionally rejected in this slice because the mixed-placement path is still prefix-shaped and would be misleading if reported as policy-driven.

The optimized-native KV cache surface now exposes eight explicit presets:

chunked
paged
streamed-segmented
log-structured-journal
sliding-window-ring-buffer
quantized-cold-tier
tiered-write-back
resident

flowchart LR A[Explicit KV preset] --> B{Full history?} B -->|Yes| C[resident / chunked / paged / streamed-segmented / log-structured-journal / quantized-cold-tier / tiered-write-back] B -->|No| D[sliding-window-ring-buffer] C --> E{Disk-backed?} E -->|No| F[resident] E -->|Yes| G[Strategy-specific on-disk root]

chunked persists a manifest-backed chunk store under cache_dir/kv_cache_chunked.
paged persists a fixed-capacity page table under cache_dir/kv_cache_paged, so movement and reconstruction stay aligned to deterministic page boundaries instead of variable chunk sizes.
streamed-segmented persists a sequential segment-backed store under cache_dir/kv_cache_streamed_segmented.
log-structured-journal persists a single append-oriented journal per layer under cache_dir/kv_cache_log_structured_journal and compacts entry metadata deterministically once the journal crosses its configured entry threshold.
sliding-window-ring-buffer persists only the bounded recent tail under cache_dir/kv_cache_sliding_window_ring_buffer; once the configured kv_cache_window_tokens limit is exceeded, the oldest cached tokens are evicted under a drop-oldest policy. This mode changes runtime semantics and should be used only when a bounded recent context is the intended contract. Current local proof keeps it as an explicit opt-in mode rather than a general default strategy.
quantized-cold-tier persists the same full-history journal shape under cache_dir/kv_cache_quantized_cold_tier, but stores colder entries in the explicit int8-symmetric-per-tensor representation and dequantizes back to the runtime dtype on load.
tiered-write-back persists only the colder KV prefix under cache_dir/kv_cache_tiered_write_back while keeping a bounded hot region in memory; its cold tier now uses a journal-backed append store. That preset is still not the full future GPU/CPU/SSD tiered architecture.
resident does not initialize any disk-KV root at all; it keeps full-history KV entirely in memory and exists as the explicit low-overhead baseline when the active runtime can afford that footprint. It is intentionally not aligned with oLLM's large-model spill/offload goal.

All seven disk-backed presets use typed raw tensor payloads plus explicit metadata, and none uses opaque pickle-backed .pt cache blobs.

The runtime also applies a platform/resource-aware buffering or spill policy on top of the selected strategy so the cache can trade write amplification against memory headroom instead of flushing every delta identically on every machine.

Runtime strategy selector¶

The runtime now has a deterministic pre-run selector above the explicit KV presets.

Current selector profiles are:

balanced
latency
capacity
bounded-window

Current selector-default candidates are intentionally narrow:

paged
resident
quantized-cold-tier

The other presets stay explicit opt-in or pinned overrides for now:

sliding-window-ring-buffer
streamed-segmented
log-structured-journal
tiered-write-back

The selector is not the same thing as the live KV adaptation surface. kv_cache_adaptation_mode still describes post-start observe-only or automatic adaptation behavior, while the selector chooses the initial strategy before the runtime starts.

Within one loaded runtime, the cache layer now also keeps a resident in-process snapshot of the reconstructed per-layer KV state so repeated updates do not need to reread and rebuild the same on-disk history every token. For the streamed-segmented store specifically, readback now coalesces extents by segment file instead of replaying a separate file-range read for every extent.

GPT-OSS `gds_export` requirement¶

The optimized GPT-OSS path is intentionally strict:

a validated gds_export/ tree must be present beside the model
the export manifest must remain inside that export directory
unsafe torch-serialized or pickle-backed artifacts are rejected

Hardware expectations¶

The benchmark harness is designed to stay truthful on limited-RAM hosts:

planner overhead and no-specialization cost do not require large model loads
runtime comparisons only load weights when the target actually exists locally
the requested primary target now reports cold-start and warm-runtime behavior separately, including TTFT, inter-token latency, prompt/output throughput, peak memory, cache footprint, and supported utilization / allocator-gap metrics
optimized-native benchmark requests can also expose native loader and KV IO timing summaries plus the storage paths used by the request
prompt-length scaling, output-length scaling, and repeated-turn session growth are measured only for the requested primary target, not for every built-in family alias
unavailable optimized comparisons are reported as unavailable instead of being fabricated

When present, the native runtime profile is the most truthful place to inspect whether an optimized request actually used GDS, standard safetensor IO, CPU-offloaded artifacts, or disk KV cache IO.

For disk KV specifically, kvload and kvsave now represent reads and writes against the selected explicit disk-KV store rather than whole-layer torch artifacts. Benchmark/runtime output also reports kv_cache_strategy so the active backend is visible during A/B runs, and cache_state exposes the hot/cold split, persisted artifact count, compaction count, and cold-store format for tier-aware strategies. The reported disk-cache footprint reflects the persisted chunk store only. A selected KV policy may keep a bounded tail in memory until its spill or flush threshold is reached. When a request is satisfied from the resident in-process KV snapshot rather than from disk, kvload can legitimately be absent for that step even though disk KV remains the selected strategy.

Those native event totals are operation-level timings. On runtimes that submit multiple storage reads before waiting for completion, the summed native IO event totals can exceed the enclosing request wall-clock time because the reads can overlap.