Optimized-native Inference API¶
These helpers expose the low-level optimized-native path directly. For new
high-level application code, prefer RuntimeClient.
Direct optimized-native helper for built-in aliases.
This class is the low-level optimized-native entry point. It is best suited
for direct model control in scripts that intentionally want to bypass the
higher-level RuntimeClient surface.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id
|
str
|
Built-in optimized alias to load. |
required |
device
|
str
|
Target device string such as |
'cuda:0'
|
logging
|
bool
|
Whether to collect runtime stats. |
True
|
multimodality
|
bool
|
Whether the runtime should plan for multimodal execution. |
False
|
specialization_registry
|
SpecializationRegistry | None
|
Optional specialization registry override. |
None
|
resolver
|
ModelResolver | None
|
Optional model resolver override. |
None
|
hf_download
¶
hf_download(
model_dir: str, force_download: bool = False
) -> None
Download the built-in optimized alias into a local directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
str
|
Target local model directory. |
required |
force_download
|
bool
|
Whether to force a fresh snapshot download. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the current |
ini_model
¶
ini_model(
models_dir: str = "./models/",
force_download: bool = False,
) -> None
Download if needed and then load the optimized-native runtime.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
models_dir
|
str
|
Parent directory that will contain the managed model directory. |
'./models/'
|
force_download
|
bool
|
Whether to force a fresh snapshot download. |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the current optimized alias is invalid or the local model directory cannot be prepared. |
load_model
¶
load_model(model_dir: str) -> None
Load an optimized-native runtime from a local directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
str
|
Local model directory for the optimized alias. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the path does not exist or the current optimized alias is invalid. |
offload_layers_to_cpu
¶
offload_layers_to_cpu(
layers_num: int, policy: str = "prefix"
) -> None
Apply CPU layer offload through the selected specialization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layers_num
|
int
|
Number of layers to place on CPU. |
required |
policy
|
str
|
Placement policy such as |
'prefix'
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the model is not loaded or the selected specialization does not expose CPU offload support. |
offload_layers_to_gpu_cpu
¶
offload_layers_to_gpu_cpu(
gpu_layers_num: int = 0, cpu_layers_num: int = 0
) -> None
Apply mixed GPU/CPU layer placement when the specialization exposes it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gpu_layers_num
|
int
|
Number of layers to keep on the accelerator. |
0
|
cpu_layers_num
|
int
|
Number of layers to move to CPU. |
0
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the specialization does not expose mixed placement support. |
DiskCache
¶
DiskCache(
cache_dir: str = "./kvcache",
cache_strategy: str | None = None,
cache_lifecycle: str | None = None,
cache_window_tokens: int | None = None,
) -> object | None
Create the specialization-backed KV cache when supported.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_dir
|
str
|
Base cache directory for the cache instance. |
'./kvcache'
|
cache_strategy
|
str | None
|
Optional explicit KV strategy override. |
None
|
cache_lifecycle
|
str | None
|
Optional cache lifecycle override. |
None
|
cache_window_tokens
|
int | None
|
Optional sliding-window token budget override. |
None
|
Returns:
| Type | Description |
|---|---|
object | None
|
object | None: Specialization-backed cache object, or |
object | None
|
the loaded specialization does not expose a cache factory. |
Bases: Inference
Optimized-native helper for compatible local model directories.
AutoInference inspects a local model directory, infers the matching
optimized-native family, and then loads the same optimized path that
Inference uses for built-in aliases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_dir
|
str
|
Local model directory to inspect and load. |
required |
adapter_dir
|
str | None
|
Optional LoRA adapter directory. |
None
|
device
|
str
|
Target device string such as |
'cuda:0'
|
logging
|
bool
|
Whether to collect runtime stats. |
True
|
multimodality
|
bool
|
Whether the runtime should plan for multimodal execution. |
False
|
specialization_registry
|
SpecializationRegistry | None
|
Optional specialization registry override. |
None
|
resolver
|
ModelResolver | None
|
Optional model resolver override. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
Raised when the local model directory is missing, uses an unsupported architecture, or references an invalid adapter directory. |