Feature request: partial layer loading for distributed inference on RK3588

## Context

We're integrating the RKLLM runtime into distributed inference frameworks (specifically [exo](https://github.com/exo-explore/exo)) to enable RK3588 devices as nodes in heterogeneous AI clusters alongside GPUs and Apple Silicon.

## Problem

The current `.rkllm` model format is compiled as a monolith — the entire model must be loaded on a single device. This prevents pipeline-parallel distributed inference where different layers of a model run on different devices.

GPU-based inference engines (MLX, tinygrad, vLLM) can load arbitrary layer ranges from a model, which is what enables splitting a 70B model across multiple consumer devices. The RKLLM runtime cannot do this today.

## Request

Add support for partial layer loading in the RKLLM runtime, so that a compiled model can be loaded with a specific layer range (e.g. layers 0-15 of a 32-layer model). This would enable:

1. **Pipeline parallelism across multiple RK3588 boards** — split a model too large for one board's 16GB RAM across 2-3 boards
2. **Heterogeneous clusters** — an RK3588 handles the first N layers while a GPU handles the rest
3. **Per-core layer assignment** — the 3 NPU cores on RK3588 could each handle different layer ranges of the same model

## Current workaround

We use the RK3588 for task-parallel workloads (embeddings, reranking, small chat models) via our own scheduler, and reserve pipeline-parallel distributed inference for GPU/Apple Silicon nodes via exo. This works but means RK3588 can't participate in large-model inference at all.

## Proposed API

Something like:

```c
// At model init time, specify which layers this instance handles
rkllm_param.start_layer = 0;
rkllm_param.end_layer = 15;
rkllm_param.total_layers = 32;
rkllm_init(&ctx, &rkllm_param);

// Or at compile time, produce a layer-range-aware .rkllm file
rknn_convert --model qwen2.5-3b --layers 0-15 --output qwen2.5-3b-layers-0-15.rkllm
```

## Hardware

Tested on Orange Pi 5 Plus (RK3588, 16GB RAM, librknnrt 2.3.0).

## Related

- exo-explore/exo#1878 (ARM64 NPU backend request)
- exo-explore/exo#859 (RKLLM support request)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: partial layer loading for distributed inference on RK3588 #489

Context

Problem

Request

Current workaround

Proposed API

Hardware

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature request: partial layer loading for distributed inference on RK3588 #489

Description

Context

Problem

Request

Current workaround

Proposed API

Hardware

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions