Context
We're integrating the RKLLM runtime into distributed inference frameworks (specifically exo) to enable RK3588 devices as nodes in heterogeneous AI clusters alongside GPUs and Apple Silicon.
Problem
The current .rkllm model format is compiled as a monolith — the entire model must be loaded on a single device. This prevents pipeline-parallel distributed inference where different layers of a model run on different devices.
GPU-based inference engines (MLX, tinygrad, vLLM) can load arbitrary layer ranges from a model, which is what enables splitting a 70B model across multiple consumer devices. The RKLLM runtime cannot do this today.
Request
Add support for partial layer loading in the RKLLM runtime, so that a compiled model can be loaded with a specific layer range (e.g. layers 0-15 of a 32-layer model). This would enable:
- Pipeline parallelism across multiple RK3588 boards — split a model too large for one board's 16GB RAM across 2-3 boards
- Heterogeneous clusters — an RK3588 handles the first N layers while a GPU handles the rest
- Per-core layer assignment — the 3 NPU cores on RK3588 could each handle different layer ranges of the same model
Current workaround
We use the RK3588 for task-parallel workloads (embeddings, reranking, small chat models) via our own scheduler, and reserve pipeline-parallel distributed inference for GPU/Apple Silicon nodes via exo. This works but means RK3588 can't participate in large-model inference at all.
Proposed API
Something like:
// At model init time, specify which layers this instance handles
rkllm_param.start_layer = 0;
rkllm_param.end_layer = 15;
rkllm_param.total_layers = 32;
rkllm_init(&ctx, &rkllm_param);
// Or at compile time, produce a layer-range-aware .rkllm file
rknn_convert --model qwen2.5-3b --layers 0-15 --output qwen2.5-3b-layers-0-15.rkllm
Hardware
Tested on Orange Pi 5 Plus (RK3588, 16GB RAM, librknnrt 2.3.0).
Related
Context
We're integrating the RKLLM runtime into distributed inference frameworks (specifically exo) to enable RK3588 devices as nodes in heterogeneous AI clusters alongside GPUs and Apple Silicon.
Problem
The current
.rkllmmodel format is compiled as a monolith — the entire model must be loaded on a single device. This prevents pipeline-parallel distributed inference where different layers of a model run on different devices.GPU-based inference engines (MLX, tinygrad, vLLM) can load arbitrary layer ranges from a model, which is what enables splitting a 70B model across multiple consumer devices. The RKLLM runtime cannot do this today.
Request
Add support for partial layer loading in the RKLLM runtime, so that a compiled model can be loaded with a specific layer range (e.g. layers 0-15 of a 32-layer model). This would enable:
Current workaround
We use the RK3588 for task-parallel workloads (embeddings, reranking, small chat models) via our own scheduler, and reserve pipeline-parallel distributed inference for GPU/Apple Silicon nodes via exo. This works but means RK3588 can't participate in large-model inference at all.
Proposed API
Something like:
Hardware
Tested on Orange Pi 5 Plus (RK3588, 16GB RAM, librknnrt 2.3.0).
Related