Skip to content

Feature request: partial layer loading for distributed inference on RK3588 #489

@jaylfc

Description

@jaylfc

Context

We're integrating the RKLLM runtime into distributed inference frameworks (specifically exo) to enable RK3588 devices as nodes in heterogeneous AI clusters alongside GPUs and Apple Silicon.

Problem

The current .rkllm model format is compiled as a monolith — the entire model must be loaded on a single device. This prevents pipeline-parallel distributed inference where different layers of a model run on different devices.

GPU-based inference engines (MLX, tinygrad, vLLM) can load arbitrary layer ranges from a model, which is what enables splitting a 70B model across multiple consumer devices. The RKLLM runtime cannot do this today.

Request

Add support for partial layer loading in the RKLLM runtime, so that a compiled model can be loaded with a specific layer range (e.g. layers 0-15 of a 32-layer model). This would enable:

  1. Pipeline parallelism across multiple RK3588 boards — split a model too large for one board's 16GB RAM across 2-3 boards
  2. Heterogeneous clusters — an RK3588 handles the first N layers while a GPU handles the rest
  3. Per-core layer assignment — the 3 NPU cores on RK3588 could each handle different layer ranges of the same model

Current workaround

We use the RK3588 for task-parallel workloads (embeddings, reranking, small chat models) via our own scheduler, and reserve pipeline-parallel distributed inference for GPU/Apple Silicon nodes via exo. This works but means RK3588 can't participate in large-model inference at all.

Proposed API

Something like:

// At model init time, specify which layers this instance handles
rkllm_param.start_layer = 0;
rkllm_param.end_layer = 15;
rkllm_param.total_layers = 32;
rkllm_init(&ctx, &rkllm_param);

// Or at compile time, produce a layer-range-aware .rkllm file
rknn_convert --model qwen2.5-3b --layers 0-15 --output qwen2.5-3b-layers-0-15.rkllm

Hardware

Tested on Orange Pi 5 Plus (RK3588, 16GB RAM, librknnrt 2.3.0).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions