Optimized-native Inference API¶

These helpers expose the low-level optimized-native path directly. For new high-level application code, prefer RuntimeClient.

Download only runtime-critical Hugging Face artifacts into a local model directory.

Direct optimized-native helper for built-in aliases.

This class is the low-level optimized-native entry point. It is best suited for direct model control in scripts that intentionally want to bypass the higher-level RuntimeClient surface.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Built-in optimized alias to load.	required
`device`	`str`	Target device string such as `"cuda:0"` or `"mps"`.	`'cuda:0'`
`logging`	`bool`	Whether to collect runtime stats.	`True`
`multimodality`	`bool`	Whether the runtime should plan for multimodal execution.	`False`
`specialization_registry`	`SpecializationRegistry \| None`	Optional specialization registry override.	`None`
`resolver`	`ModelResolver \| None`	Optional model resolver override.	`None`

hf_download ¶

hf_download(
    model_dir: str, force_download: bool = False
) -> None

Download the built-in optimized alias into a local directory.

Parameters:

Name	Type	Description	Default
`model_dir`	`str`	Target local model directory.	required
`force_download`	`bool`	Whether to force a fresh snapshot download.	`False`

Raises:

Type	Description
`ValueError`	Raised when the current `optimized_model_id` is not a built-in optimized alias.

ini_model ¶

ini_model(
    models_dir: str = "./models/",
    force_download: bool = False,
) -> None

Download if needed and then load the optimized-native runtime.

Parameters:

Name	Type	Description	Default
`models_dir`	`str`	Parent directory that will contain the managed model directory.	`'./models/'`
`force_download`	`bool`	Whether to force a fresh snapshot download.	`False`

Raises:

Type	Description
`ValueError`	Raised when the current optimized alias is invalid or the local model directory cannot be prepared.

load_model ¶

load_model(model_dir: str) -> None

Load an optimized-native runtime from a local directory.

Parameters:

Name	Type	Description	Default
`model_dir`	`str`	Local model directory for the optimized alias.	required

Raises:

Type	Description
`ValueError`	Raised when the path does not exist or the current optimized alias is invalid.

offload_layers_to_cpu ¶

offload_layers_to_cpu(
    layers_num: int, policy: str = "prefix"
) -> None

Apply CPU layer offload through the selected specialization.

Parameters:

Name	Type	Description	Default
`layers_num`	`int`	Number of layers to place on CPU.	required
`policy`	`str`	Placement policy such as `"prefix"`, `"suffix"`, or `"middle-band"`.	`'prefix'`

Raises:

Type	Description
`ValueError`	Raised when the model is not loaded or the selected specialization does not expose CPU offload support.

offload_layers_to_gpu_cpu ¶

offload_layers_to_gpu_cpu(
    gpu_layers_num: int = 0, cpu_layers_num: int = 0
) -> None

Apply mixed GPU/CPU layer placement when the specialization exposes it.

Parameters:

Name	Type	Description	Default
`gpu_layers_num`	`int`	Number of layers to keep on the accelerator.	`0`
`cpu_layers_num`	`int`	Number of layers to move to CPU.	`0`

Raises:

Type	Description
`ValueError`	Raised when the specialization does not expose mixed placement support.

DiskCache ¶

DiskCache(
    cache_dir: str = "./kvcache",
    cache_strategy: str | None = None,
    cache_lifecycle: str | None = None,
    cache_window_tokens: int | None = None,
) -> object | None

Create the specialization-backed KV cache when supported.

Parameters:

Name	Type	Description	Default
`cache_dir`	`str`	Base cache directory for the cache instance.	`'./kvcache'`
`cache_strategy`	`str \| None`	Optional explicit KV strategy override.	`None`
`cache_lifecycle`	`str \| None`	Optional cache lifecycle override.	`None`
`cache_window_tokens`	`int \| None`	Optional sliding-window token budget override.	`None`

Returns:

Type	Description
`object \| None`	object \| None: Specialization-backed cache object, or `None` when
`object \| None`	the loaded specialization does not expose a cache factory.

Bases: Inference

Optimized-native helper for compatible local model directories.

AutoInference inspects a local model directory, infers the matching optimized-native family, and then loads the same optimized path that Inference uses for built-in aliases.

Parameters:

Name	Type	Description	Default
`model_dir`	`str`	Local model directory to inspect and load.	required
`adapter_dir`	`str \| None`	Optional LoRA adapter directory.	`None`
`device`	`str`	Target device string such as `"cuda:0"` or `"mps"`.	`'cuda:0'`
`logging`	`bool`	Whether to collect runtime stats.	`True`
`multimodality`	`bool`	Whether the runtime should plan for multimodal execution.	`False`
`specialization_registry`	`SpecializationRegistry \| None`	Optional specialization registry override.	`None`
`resolver`	`ModelResolver \| None`	Optional model resolver override.	`None`

Raises:

Type	Description
`ValueError`	Raised when the local model directory is missing, uses an unsupported architecture, or references an invalid adapter directory.