KV Strategy Matrix¶
oLLM's current KV cache presets are still selected as single strings such as
resident, chunked, paged, streamed-segmented, log-structured-journal,
sliding-window-ring-buffer, quantized-cold-tier, and
tiered-write-back.
That is still the public control surface today, but the system is now being scaffolded around a more explicit internal matrix so future strategies do not turn into one giant preset enum with hidden semantics.
Current matrix axes¶
The current scaffold separates these concerns:
- persistence format
- residency mode
- window policy
- cold-tier encoding
- cache lifecycle
- adaptation mode
The first four describe the current preset itself. The last two describe how that preset is owned and how telemetry may influence future runtime choices.
Presets versus axes¶
Current presets are now understood as bundles of axis values:
| Preset | Persistence Format | Residency Mode | Window Policy | Cold-Tier Encoding |
|---|---|---|---|---|
resident |
resident-only |
fully-resident |
full-history |
full-precision |
chunked |
chunked-manifest |
buffered-tail |
full-history |
full-precision |
paged |
paged-manifest |
buffered-tail |
full-history |
full-precision |
streamed-segmented |
streamed-segmented |
buffered-tail |
full-history |
full-precision |
log-structured-journal |
log-structured-journal |
buffered-tail |
full-history |
full-precision |
sliding-window-ring-buffer |
sliding-window-ring-buffer |
buffered-tail |
sliding-window |
full-precision |
quantized-cold-tier |
log-structured-journal |
buffered-tail |
full-history |
quantized |
tiered-write-back |
log-structured-journal |
tiered-write-back |
full-history |
full-precision |
The current tiered-write-back preset should not be confused with a future
broader multi-tier GPU/CPU/SSD architecture.
This does not erase the semantic differences between presets. For example,
sliding-window-ring-buffer deliberately preserves only a bounded recent
history under a drop-oldest eviction policy; it is not a transparent
substitute for the full-history strategies. The matrix still keeps the shape
explicit so future presets and richer bounded window variants can compose
cleanly.
Cache lifecycle¶
Cache lifecycle is intentionally treated as a separate axis:
runtime-scopedpersistent
Current default behavior is still runtime-scoped, but persistent is now an
explicit implemented mode.
That means oLLM can now distinguish between:
- within-runtime reuse
- explicit persistent reuse across later runs under a lifecycle-aware, model/backend-scoped cache root
This distinction matters because cross-run persistence adds separate concerns:
- cache identity and invalidation
- schema compatibility
- partial-write recovery
- retention / GC
- multi-process coordination
- benchmark truth for cold versus warm starts
Adaptation mode¶
Adaptation mode is also scaffolded explicitly:
disabledobserve-onlyautomatic
Current behavior now supports observe-only recommendation rules. The runtime can emit a truthful recommendation from current KV state, but it does not yet switch KV strategies live.
The intended progression is:
- report signals
- make observe-only recommendations
- prove the recommendation loop against benchmarks and repeated-session runs
- only then enable real automatic policy changes
Strategy selector¶
The repo now also has a deterministic pre-run strategy selector above the explicit presets.
Current selector profiles are:
balancedlatencycapacitybounded-window
Current selector-default candidates are intentionally conservative:
pagedresidentquantized-cold-tier
The following presets remain explicit opt-in or pinned overrides:
sliding-window-ring-bufferstreamed-segmentedlog-structured-journaltiered-write-back
This selector is separate from kv_cache_adaptation_mode.
- selector: chooses the initial strategy before runtime execution
- adaptation mode: emits observe-only recommendations or future live changes from current KV state
Resident-state observability¶
The in-process resident KV snapshot is now treated as a first-class observable part of cache state, not just an implementation detail.
Current reporting now has room to distinguish:
- persisted state
- resident in-process state
- hot/pending tails
- bounded-window limits and eviction totals
- cold-tier representation when the persisted encoding is quantized
That matters because resident reuse can improve request behavior without showing up as persisted on-disk KV activity.
Near-term intent¶
This scaffold is meant to support the next wave of work:
- paged persistence
- quantized cold tiers
- a future page-aware GPU / CPU / SSD tiered architecture
- observe-only adaptation recommendations
- stronger persistent lifecycle reuse and retention policy
The important rule is: add new behavior by filling in these axes, not by letting one preset label absorb unrelated semantics forever.