This document describes the gpt_rs::runtime layer: how checkpoints are loaded into a dynamic model
handle, how runtime functional overrides are applied, and how the CLI calls into models without
hardcoding model kinds.
Source of truth: crates/gpt-rs/src/runtime/mod.rs and crates/gpt-rs-cli/src/main.rs.
The canonical loader is:
gpt_rs::runtime::load_model(backend, checkpoint_path) -> Box<dyn LoadedModel<B>>
The loader:
- Opens a self-describing checkpoint (
GPTRSCHK) and readsModelConfig+ tensor index. - Builds a
FunctionalRegistry<B>fromModelConfig.runtime.functional_overrides. - Creates a checkpoint-backed
ParamSource<B>for random-access parameter loads. - Builds a
get(name)closure that returnsDeviceTensor::lazy_param(...)for each parameter. - Selects a model factory by
ModelConfig.kindand constructs the model usingget(name). - Wraps the model in
ModelHandle<B>, which ensures the functional registry is installed for every call.
Models are exposed through a small dynamic trait:
LoadedModel<B>:kind(),forward(ModelInput) -> ModelOutput- Optional capabilities exposed via trait methods returning
Option<...>:as_causal_lm() -> Option<&dyn CausalLanguageModel<B>>
This is what makes the CLI generic: it asks the model for a capability (e.g. "causal LM") instead of switching on an enum of model kinds.
See:
crates/gpt-rs/src/runtime/mod.rs(LoadedModel,ModelInput,ModelOutput,ModelHandle)crates/gpt-rs/src/inference/mod.rs(CausalLanguageModel)
Most functionals are dispatched through a FunctionalRegistry<B> (portable baseline by default).
runtime::load_model builds a registry using ModelConfig.runtime.functional_overrides and wraps the
inner model with ModelHandle<B>. ModelHandle calls ops::functional::with_registry(...) around:
LoadedModel::forwardCausalLanguageModelmethods when the model is used as a causal LM
This means:
- model and layer code stays backend-agnostic (it calls portable functionals)
- runtime decides which implementations are active for a given model instance
Parameter identity is split into two layers (see crates/gpt-rs/src/params.rs):
BaseParamId(u128): deterministic hash of the parameter name (stable across runs).ModelNamespaceId(u128): runtime-assigned namespace per loaded model instance.ParamKey(u128): stable key derived from(namespace, base_id)used for caching/resolvers.
load_model picks a fresh namespace (next_namespace()), then for each parameter name in the checkpoint index:
- computes a
ParamKeyfor the model instance - creates
DeviceTensor::lazy_param(backend, shape, dtype, stable_id=ParamKey, base_id, source, ...)
The ParamSource<B> is checkpoint-backed and loads tensors by BaseParamId on demand. This keeps memory
usage proportional to the set of parameters actually touched (important for sparse models like MoE).
Backends may provide a ParamResolver (PortableBackend::param_resolver) so derived parameter formats (packed
weights, layouts) can be memoized by stable id without changing model code.
gpt-rs-cli is capability-based:
generate: requiresmodel.as_causal_lm()and usesCausalLanguageModel(greedy/sampling + optional KV cache).forward: callsmodel.forward(...)for either token inputs or vision inputs.
See: crates/gpt-rs-cli/src/main.rs (generate / forward subcommands).
After implementing your model and LoadedModel<B> impl, register it with the runtime factory list:
crates/gpt-rs/src/runtime/mod.rs:model_factories()/model_factory(kind)
For a full checklist (model + layer + functional + backend), see docs/howto.md.