Skip to content

Optimized-native Inference API

These helpers expose the low-level optimized-native path directly. For new high-level application code, prefer RuntimeClient.

Download only runtime-critical Hugging Face artifacts into a local model directory.

Direct optimized-native helper for built-in aliases.

This class is the low-level optimized-native entry point. It is best suited for direct model control in scripts that intentionally want to bypass the higher-level RuntimeClient surface.

Parameters:

Name Type Description Default
model_id str

Built-in optimized alias to load.

required
device str

Target device string such as "cuda:0" or "mps".

'cuda:0'
logging bool

Whether to collect runtime stats.

True
multimodality bool

Whether the runtime should plan for multimodal execution.

False
specialization_registry SpecializationRegistry | None

Optional specialization registry override.

None
resolver ModelResolver | None

Optional model resolver override.

None

hf_download

hf_download(
    model_dir: str, force_download: bool = False
) -> None

Download the built-in optimized alias into a local directory.

Parameters:

Name Type Description Default
model_dir str

Target local model directory.

required
force_download bool

Whether to force a fresh snapshot download.

False

Raises:

Type Description
ValueError

Raised when the current optimized_model_id is not a built-in optimized alias.

ini_model

ini_model(
    models_dir: str = "./models/",
    force_download: bool = False,
) -> None

Download if needed and then load the optimized-native runtime.

Parameters:

Name Type Description Default
models_dir str

Parent directory that will contain the managed model directory.

'./models/'
force_download bool

Whether to force a fresh snapshot download.

False

Raises:

Type Description
ValueError

Raised when the current optimized alias is invalid or the local model directory cannot be prepared.

load_model

load_model(model_dir: str) -> None

Load an optimized-native runtime from a local directory.

Parameters:

Name Type Description Default
model_dir str

Local model directory for the optimized alias.

required

Raises:

Type Description
ValueError

Raised when the path does not exist or the current optimized alias is invalid.

offload_layers_to_cpu

offload_layers_to_cpu(
    layers_num: int, policy: str = "prefix"
) -> None

Apply CPU layer offload through the selected specialization.

Parameters:

Name Type Description Default
layers_num int

Number of layers to place on CPU.

required
policy str

Placement policy such as "prefix", "suffix", or "middle-band".

'prefix'

Raises:

Type Description
ValueError

Raised when the model is not loaded or the selected specialization does not expose CPU offload support.

offload_layers_to_gpu_cpu

offload_layers_to_gpu_cpu(
    gpu_layers_num: int = 0, cpu_layers_num: int = 0
) -> None

Apply mixed GPU/CPU layer placement when the specialization exposes it.

Parameters:

Name Type Description Default
gpu_layers_num int

Number of layers to keep on the accelerator.

0
cpu_layers_num int

Number of layers to move to CPU.

0

Raises:

Type Description
ValueError

Raised when the specialization does not expose mixed placement support.

DiskCache

DiskCache(
    cache_dir: str = "./kvcache",
    cache_strategy: str | None = None,
    cache_lifecycle: str | None = None,
    cache_window_tokens: int | None = None,
) -> object | None

Create the specialization-backed KV cache when supported.

Parameters:

Name Type Description Default
cache_dir str

Base cache directory for the cache instance.

'./kvcache'
cache_strategy str | None

Optional explicit KV strategy override.

None
cache_lifecycle str | None

Optional cache lifecycle override.

None
cache_window_tokens int | None

Optional sliding-window token budget override.

None

Returns:

Type Description
object | None

object | None: Specialization-backed cache object, or None when

object | None

the loaded specialization does not expose a cache factory.

Bases: Inference

Optimized-native helper for compatible local model directories.

AutoInference inspects a local model directory, infers the matching optimized-native family, and then loads the same optimized path that Inference uses for built-in aliases.

Parameters:

Name Type Description Default
model_dir str

Local model directory to inspect and load.

required
adapter_dir str | None

Optional LoRA adapter directory.

None
device str

Target device string such as "cuda:0" or "mps".

'cuda:0'
logging bool

Whether to collect runtime stats.

True
multimodality bool

Whether the runtime should plan for multimodal execution.

False
specialization_registry SpecializationRegistry | None

Optional specialization registry override.

None
resolver ModelResolver | None

Optional model resolver override.

None

Raises:

Type Description
ValueError

Raised when the local model directory is missing, uses an unsupported architecture, or references an invalid adapter directory.