Weight Transport ADR¶
This page is contributor-facing architectural guidance.
It is not customer-facing product documentation and not an implementation claim. It records the recommended direction for compressed or FP8-staged weight transport in streamed native runtimes so later implementation work stays grounded in the current oLLM architecture.
Status¶
- audience: implementers and maintainers
- kind: ADR / design decision
- implementation status: not implemented
Scope¶
This ADR defines the recommended direction for reducing the amount of weight data moved during layer-streamed native execution.
It covers:
- layer-streamed weight loads from safetensors and
gds_export - optional transport encodings separate from execution dtype
- required safety, materialization, and benchmark changes
It does not propose a general model-format rewrite.
Decision summary¶
oLLM should not start with generic "compressed transport" across all streamed native runtimes.
The recommended direction is:
- reject generic lossless compressed transport as the first implementation slice
- treat transport encoding as separate from both authoritative on-disk model format and final execution dtype
- start with one explicit, manifest-backed transport contract on the custom
gds_exportpath - make the first concrete implementation target CUDA-only FP8-staged transport with explicit upcast before execution
- require transport-specific benchmark and inspection fields before any speed or memory claim
- require separately labeled upstream concept baselines for optimization claims
Why the current baseline is not enough¶
Current streamed native runtimes still move full execution-ready weight bytes for each loaded layer.
That limitation is visible in the current code:
src/ollm/llama.pyandsrc/ollm/gemma3.pypreload and materialize one layer's safetensors, queue the next dense layer's reads one stage ahead when possible, assign tensors into the live module, then unload them again after the forward pass.src/ollm/gds_loader.pyreads direct byte ranges from safetensor files or rawgds_exportblobs into CPU or CUDA buffers.src/ollm/gds_async.pyreturns execution-ready tensors from pending read helpers; it does not expose a transport-decode stage yet.
The benchmark docs already identify loader-streamed safetensor IO as a real cost on some lanes. That makes weight transport a credible future optimization target, but only if the design stays truthful about where the bytes and latency actually move.
Design goals¶
- Reduce moved bytes on loader-streamed paths where IO or transfer is the real bottleneck.
- Keep transport encoding separate from execution dtype and from authoritative at-rest model format.
- Preserve explicit safety validation for all transport artifacts.
- Make benchmark evidence able to distinguish encoded bytes, decode work, and final runtime cost.
- Keep fallback behavior explicit instead of pretending an unsupported transport path succeeded.
Non-goals¶
- Universal support across all native families in one slice.
- Replacing safetensors as the default model format.
- Hand-wavy 100B-plus claims without IO math and proof.
- Treating FP8-staged transport as end-to-end FP8 inference.
- Claiming GPU-side decompression before the repo has an explicit kernel/runtime path for it.
Critical truth constraints¶
Transport encoding is not at-rest format¶
The ADR must keep three concepts separate:
- authoritative on-disk model format
- optional streamed transport encoding
- final runtime execution dtype
If those are blurred together, the work stops being a transport design and quietly becomes a model-format rewrite proposal.
FP8 staging is not already supported¶
Today the readers map storage bytes directly into execution-ready tensors.
There is no current transport contract for:
- encoded dtype
- decode target dtype
- decode location
- scratch-buffer budget
- encoded versus decoded byte counts
So FP8-staged transport is not a loader tweak. It needs a first-class contract.
Optimization claims require upstream baselines¶
Fork-internal A/B results are not enough.
Future transport work must compare:
- current fork baseline
- transport-enabled fork result
- separately labeled upstream concept baseline where applicable
The existing benchmark workflow already documents how upstream baselines stay outside this repo tree and keep separate codebase labels.
Credible options¶
| Option | Credibility in current architecture | Why |
|---|---|---|
| Generic seekable lossless compression over safetensors | low | Current safetensor readers depend on raw offset-addressable bytes and would need a new seekable compressed container plus decompression-aware readers |
| FP8-staged transport with upcast before execution | medium | Credible only if modeled as an explicit transport encoding contract, not as a safetensor dtype shortcut |
Narrow first step on gds_export only |
high | gds_export already uses an explicit manifest, strict safety validation, and a dedicated GDS loader path |
Recommended first direction¶
The recommended first direction is:
- define one explicit transport contract for
gds_export - use that contract for CUDA-only FP8-staged transport
- upcast to the current execution dtype before layer execution
Why this is the most credible first step:
- GPT-OSS already has a manifest-backed
gds_exportspecialization path - the path is already safety-gated
- the path already bypasses the generic safetensor-reader assumptions
- benchmark/reporting can extend one explicit path without pretending the whole runtime stack already supports transport encodings
The ADR explicitly rejects generic lossless compression as v1 because the current CPU and GPU safetensor readers assume direct typed byte ranges, not a compressed-block container.
Recommended transport contract¶
The first transport-aware manifest should make these fields explicit per tensor:
encodingencoded_dtypeencoded_nbytesdecoded_dtypedecoded_shapedecoded_nbytesscale_encodingscale_pathchecksumdecode_location
Recommended initial decode_location values:
cuda-upcastcpu-decode-then-transfer
The first implementation recommendation is to support only cuda-upcast.
That keeps the runtime claim narrow and avoids pretending CPU or MPS benefits exist before they are proven.
Safety and materialization implications¶
Current safety rules are intentionally strict:
- unsafe model artifacts are rejected
- unsafe adapter artifacts are rejected
gds_exportrejects torch-serialized and pickle-backed filesgds_exportalso rejects packedmxfp4artifacts
Any future transport format must preserve that bar.
Required safety rules for a future transport implementation:
- transport manifests must remain inside the validated export root
- transport blobs must use explicit allowed extensions
- transport manifests must carry stable encoding identifiers and checksums
- unsafe executable, pickle-backed, or torch-serialized transport blobs remain forbidden
- invalid, missing, or stale sidecar transport artifacts must invalidate the transport path instead of being used opportunistically
Materialization implications:
- managed downloads must know how to fetch transport manifests and blobs
- completeness checks must validate both base model artifacts and transport sidecars
- the absence of valid transport artifacts must fall back to today's safe
safetensor or raw
gds_exportpaths
Benchmark and inspection truth¶
Current benchmark surfaces already show native IO events such as:
gds_readsafetensor_readsafetensor_preadoffloaded_cpu_to_cuda
That is not enough for transport claims.
Before any transport optimization claim, the future implementation must expose:
- transport encoding used
- encoded bytes read
- decoded bytes produced
- decode or upcast time
- transport compression ratio
- scratch memory used for decode or upcast
- fallback reason when transport was requested but not used
Without those fields, oLLM would risk claiming faster or smaller transport while hiding where the cost really moved.
Failure and fallback behavior¶
The ADR makes these rules explicit:
- unsupported backend or device combinations must fall back to the current non-transport path
- fallback must be observable in runtime planning or benchmark output
- invalid transport manifests or blobs must reject the transport path, not be silently tolerated
- the runtime must never claim FP8-staged transport if it actually ran on raw
safetensor or raw
gds_exportbytes
Recommended phased implementation order¶
Phase 1¶
Add transport-specific benchmark and inspection contract fields.
Recommended first slice:
- extend native runtime reporting with encoded-bytes and decode-time fields
- make fallback reasons explicit
- keep runtime behavior otherwise unchanged
Phase 2¶
Define one explicit transport manifest and validation contract for gds_export.
Recommended first slice:
- one safe manifest schema
- one allowed transport encoding family
- one materialization completeness rule set
Phase 3¶
Implement CUDA-only FP8-staged transport on GPT-OSS gds_export.
Recommended first slice:
- one model family
- one device class
- explicit upcast before execution
- benchmark proof against current-fork and upstream concept baselines
Phase 4¶
Evaluate expansion to other streamed native families.
Do this only if:
- the new metrics show real end-to-end wins
- the quality or output delta is acceptable
- the transport contract still fits without pretending safetensors natively carry transport metadata
Phase 5¶
Revisit lossless compression only if the repo gains one of:
- a seekable compressed-block reader contract
- an explicit GPU-side decompression path with named runtime support
Until then, generic compressed transport remains a later research option, not the recommended first move.