Skip to content

Commit cb5b341

Browse files
author
will.yang
committed
release v1.2.0
1 parent 29a9fb9 commit cb5b341

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1249
-368
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,20 @@
11
# CHANGELOG
2+
## v1.2.0
3+
4+
- Supports custom model conversion.
5+
- Supports chat_template configuration.
6+
- Enables multi-turn dialogue interactions.
7+
- Implements automatic prompt cache reuse for improved inference efficiency.
8+
- Expands maximum context length to 16K.
9+
- Supports embedding flash storage to reduce memory usage.
10+
- Introduces the GRQ Int4 quantization algorithm.
11+
- Supports GPTQ-Int8 model conversion.
12+
- Compatible with the RK3562 platform.
13+
- Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
14+
- Supports CPU core configuration.
15+
- Added support for Gemma3
16+
- Added support for Python 3.9/3.11/3.12
17+
218
## v1.1.0
319
- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
420
- Support joint inference with LoRA model loading

README.md

Lines changed: 84 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
- RK3588 Series
2020
- RK3576 Series
21+
- RK3562 Series
2122

2223
# Support Models
2324

@@ -26,50 +27,61 @@
2627
- [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
2728
- [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
2829
- [x] [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b/tree/103caa40027ebfd8450289ca2f278eac4ff26405)
29-
- [x] [Gemma models](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
30+
- [x] [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)
31+
- [x] [Gemma3](https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d)
3032
- [x] [InternLM2 models](https://huggingface.co/collections/internlm/internlm2-65b0ce04970888799707893c)
3133
- [x] [MiniCPM models](https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f)
3234
- [x] [TeleChat models](https://huggingface.co/Tele-AI)
33-
- [x] [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
34-
- [x] [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V-2_6)
35+
- [x] [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
36+
- [x] [MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
3537
- [x] [DeepSeek-R1-Distill](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d)
38+
- [x] [Janus-Pro-1B](https://huggingface.co/deepseek-ai/Janus-Pro-1B)
39+
- [x] [InternVL2-1B](https://huggingface.co/OpenGVLab/InternVL2-1B)
40+
- [x] [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
3641

3742
# Model Performance Benchmark
3843

39-
| llm model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform |
40-
| :------------- | :--------- | :----: | :---------: | :--------: | :------: | :------: | :-------: | :------: |
41-
| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 |
42-
| | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 |
43-
| | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 |
44-
| | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 |
45-
| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 |
46-
| | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 |
47-
| | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 |
48-
| | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 |
49-
| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 |
50-
| | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 |
51-
| | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 |
52-
| | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 |
53-
| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 |
54-
| | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 |
55-
| | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 |
56-
| | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 |
57-
| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 |
58-
| | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 |
59-
| | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 |
60-
| | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 |
61-
| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 |
62-
| | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 |
63-
| | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 |
64-
| | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 |
65-
| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 |
66-
| | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 |
67-
| | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 |
68-
| | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 |
69-
| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 |
70-
| | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 |
71-
| | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 |
72-
| | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 |
44+
| llm model | platform | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) |
45+
| :------------- | :------: | :--------- | :----: | :---------: | :--------: | :------: | :------: | :-------: |
46+
| Qwen2-0.5B | RK3562 | w4a16_g128 | 64 | 320 | 256 | 524 | 5.67 | 0.39 |
47+
| | RK3562 | w4a8_g32 | 64 | 320 | 256 | 873 | 12.00 | 0.48 |
48+
| | RK3562 | w8a8 | 64 | 320 | 256 | 477 | 11.50 | 0.61 |
49+
| | RK3576 | w4a16 | 64 | 320 | 256 | 204 | 34.50 | 0.40 |
50+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 212 | 32.40 | 0.40 |
51+
| | RK3588 | w8a8 | 64 | 320 | 256 | 79 | 41.50 | 0.62 |
52+
| | RK3588 | w8a8_g128 | 64 | 320 | 256 | 183 | 25.07 | 0.75 |
53+
| TinyLLAMA-1.1B | RK3576 | w4a16 | 64 | 320 | 256 | 345 | 21.10 | 0.77 |
54+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 410 | 18.50 | 0.80 |
55+
| | RK3588 | w8a8 | 64 | 320 | 256 | 140 | 24.21 | 1.25 |
56+
| | RK3588 | w8a8_g512 | 64 | 320 | 256 | 195 | 20.08 | 1.29 |
57+
| Qwen2-1.5B | RK3576 | w4a16 | 64 | 320 | 256 | 512 | 14.40 | 1.75 |
58+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 550 | 12.75 | 1.76 |
59+
| | RK3588 | w8a8 | 64 | 320 | 256 | 206 | 16.46 | 2.47 |
60+
| | RK3588 | w8a8_g128 | 64 | 320 | 256 | 725 | 7.00 | 2.65 |
61+
| Phi-3-3.8B | RK3576 | w4a16 | 64 | 320 | 256 | 975 | 6.60 | 2.16 |
62+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1180 | 5.85 | 2.23 |
63+
| | RK3588 | w8a8 | 64 | 320 | 256 | 516 | 7.44 | 3.88 |
64+
| | RK3588 | w8a8_g512 | 64 | 320 | 256 | 610 | 6.13 | 3.95 |
65+
| ChatGLM3-6B | RK3576 | w4a16 | 64 | 320 | 256 | 1168 | 4.62 | 3.86 |
66+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1583 | 3.82 | 3.96 |
67+
| | RK3588 | w8a8 | 64 | 320 | 256 | 800 | 4.95 | 6.69 |
68+
| | RK3588 | w8a8_g128 | 64 | 320 | 256 | 2190 | 2.70 | 7.18 |
69+
| Gemma2-2B | RK3576 | w4a16 | 64 | 320 | 256 | 628 | 8.00 | 3.63 |
70+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 776 | 7.40 | 3.63 |
71+
| | RK3588 | w8a8 | 64 | 320 | 256 | 342 | 9.67 | 4.84 |
72+
| | RK3588 | w8a8_g128 | 64 | 320 | 256 | 1055 | 5.49 | 5.14 |
73+
| InternLM2-1.8B | RK3576 | w4a16 | 64 | 320 | 256 | 475 | 13.30 | 1.59 |
74+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 572 | 11.95 | 1.62 |
75+
| | RK3588 | w8a8 | 64 | 320 | 256 | 206 | 15.66 | 2.38 |
76+
| | RK3588 | w8a8_g512 | 64 | 320 | 256 | 298 | 12.66 | 2.45 |
77+
| MiniCPM3-4B | RK3576 | w4a16 | 64 | 320 | 256 | 1397 | 4.80 | 2.70 |
78+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1645 | 4.39 | 2.80 |
79+
| | RK3588 | w8a8 | 64 | 320 | 256 | 702 | 6.15 | 4.65 |
80+
| | RK3588 | w8a8_g128 | 64 | 320 | 256 | 1691 | 3.42 | 5.06 |
81+
| llama3-8B | RK3576 | w4a16 | 64 | 320 | 256 | 1608 | 3.60 | 5.63 |
82+
| | RK3576 | w4a16_g128 | 64 | 320 | 256 | 2010 | 3.00 | 5.76 |
83+
| | RK3588 | w8a8 | 64 | 320 | 256 | 1128 | 3.79 | 9.21 |
84+
| | RK3588 | w8a8_g512 | 64 | 320 | 256 | 1281 | 3.05 | 9.45 |
7385

7486
| multimodal model | image input size | vision model dtype | vision infer time(s) | vision memory(MB) | llm model dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | llm memory(G) | platform |
7587
|:-------------- |:---------- |:------:|:-----------:|:----------:|:--------:|:--------:|:---------:|:--------:|:---------:|:---------:|:---------:|:---------:|
@@ -78,10 +90,17 @@
7890
| MiniCPM-V-2_6 | (1, 3, 448, 448) | fp16 | 2.40 | 1031.30 | w4a16 | 128 | 256 | 128 | 2997.70 | 3.84 | 5.50 | RK3576 |
7991
| | | fp16 | 3.27 | 976.98 | w8a8 | 128 | 256 | 128 | 1720.60 | 4.13 | 8.88 | RK3588 |
8092

81-
- This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0.
93+
- This performance data were collected based on the maximum CPU and NPU frequencies of each platform.
8294
- The script for setting the frequencies is located in the scripts directory.
8395
- The vision model were tested based on all NPU core with rknn-toolkit2 version 2.2.0.
8496

97+
# **Performance Testing Methods**
98+
99+
1. Run the frequency-setting script from the `scripts` directory on the target platform.
100+
2. Execute `export RKLLM_LOG_LEVEL=1` on the device to log model inference performance and memory usage.
101+
3. Use the `eval_perf_watch_cpu.sh` script to measure CPU utilization.
102+
4. Use the `eval_perf_watch_npu.sh` script to measure NPU utilization.
103+
85104
# Download
86105

87106
1. You can download the **latest package** from [RKLLM_SDK](https://console.zbox.filez.com/l/RJJDmB), fetch code: rkllm
@@ -92,18 +111,25 @@
92111
1. Multimodel deployment demo: [Qwen2-VL-2B_Demo](https://github.com/airockchip/rknn-llm/tree/main/examples/Qwen2-VL-2B_Demo)
93112
2. API usage demo: [DeepSeek-R1-Distill-Qwen-1.5B_Demo](https://github.com/airockchip/rknn-llm/tree/main/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo)
94113
3. API server demo: [rkllm_server_demo](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_server_demo)
114+
4. Multimodal_Interactive_Dialogue_Demo [Multimodal_Interactive_Dialogue_Demo](https://github.com/airockchip/rknn-llm/tree/main/examples/Multimodal_Interactive_Dialogue_Demo)
95115

96116
# Note
97117

98-
- The modifications in version 1.1 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference.
99-
100118
- The supported Python versions are:
101-
119+
102120
- Python 3.8
103-
121+
- Python 3.9
104122
- Python 3.10
123+
- Python 3.11
124+
- Python 3.12
125+
126+
**Note: Before installing package in a Python 3.12 environment, please run the command:**
105127

106-
- Latest version: [ <u>v1.1.4](https://github.com/airockchip/rknn-llm/releases/tag/release-v1.1.4)</u>
128+
```
129+
export BUILD_CUDA_EXT=0
130+
```
131+
- On some platforms, you may encounter an error indicating that **libomp.so** cannot be found. To resolve this, locate the library in the corresponding cross-compilation toolchain and place it in the board's lib directory, at the same level as librkllmrt.so.
132+
- Latest version: [ <u>v1.2.0](https://github.com/airockchip/rknn-llm/releases/tag/release-v1.2.0)</u>
107133

108134
# RKNN Toolkit2
109135

@@ -113,18 +139,20 @@ https://github.com/airockchip/rknn-toolkit2
113139

114140
# CHANGELOG
115141

116-
## v1.1.0
117-
118-
- Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
119-
- Support joint inference with LoRA model loading
120-
- Support storage and preloading of prompt cache.
121-
- Support gguf model conversion (currently only support q4_0 and fp16).
122-
- Optimize initialization, prefill, and decode time.
123-
- Support four input types: prompt, embedding, token, and multimodal.
124-
- Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
125-
- Add gdq algorithm to improve 4-bit quantization accuracy.
126-
- Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
127-
- Add support for models such as Llama3, Gemma2, and MiniCPM3.
128-
- Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
142+
## v1.2.0
143+
144+
- Supports custom model conversion.
145+
- Supports chat_template configuration.
146+
- Enables multi-turn dialogue interactions.
147+
- Implements automatic prompt cache reuse for improved inference efficiency.
148+
- Expands maximum context length to 16K.
149+
- Supports embedding flash storage to reduce memory usage.
150+
- Introduces the GRQ Int4 quantization algorithm.
151+
- Supports GPTQ-Int8 model conversion.
152+
- Compatible with the RK3562 platform.
153+
- Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
154+
- Supports CPU core configuration.
155+
- Added support for Gemma3
156+
- Added support for Python 3.9/3.11/3.12
129157

130158
for older version, please refer [CHANGELOG](CHANGELOG.md)
-1.46 MB
Binary file not shown.
3.24 MB
Binary file not shown.
-1.28 MB
Binary file not shown.
3.89 MB
Binary file not shown.

examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/Readme.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
## 1. Requirements
66

77
```
8-
rkllm-toolkit==1.1.4
9-
rkllm-runtime==1.1.4
10-
python==3.8 or python==3.10
8+
rkllm-toolkit==1.2.0
9+
rkllm-runtime==1.2.0
10+
python >=3.8
1111
```
1212

1313
## 2. Model Conversion
@@ -40,6 +40,8 @@ cd deploy
4040
adb push install/demo_Linux_aarch64 /data
4141
# push model file to device
4242
adb push DeepSeek-R1-Distill-Qwen-1.5B.rkllm /data/demo_Linux_aarch64
43+
# push the appropriate fixed-frequency script to the device
44+
adb push ../../../scripts/fix_freq_rk3588.sh /data/demo_Linux_aarch64
4345
```
4446

4547
### 2. Run Demo
@@ -51,7 +53,11 @@ adb shell
5153
cd /data/demo_Linux_aarch64
5254
# export lib path
5355
export LD_LIBRARY_PATH=./lib
54-
taskset f0 ./llm_demo /path/to/your/rkllm/model 2048 4096
56+
# Execute the fixed-frequency script
57+
sh fix_freq_rk3588.sh
58+
# Set the logging level for performance analysis
59+
export RKLLM_LOG_LEVEL=1
60+
./llm_demo /path/to/your/rkllm/model 2048 4096
5561

5662
# Running result
5763
rkllm init start

examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo/deploy/src/llm_demo.cpp

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,6 @@
2121
#include <csignal>
2222
#include <vector>
2323

24-
#define PROMPT_TEXT_PREFIX "<|begin▁of▁sentence|><|User|>"
25-
#define PROMPT_TEXT_POSTFIX "<|Assistant|>"
2624

2725
using namespace std;
2826
LLMHandle llmHandle = nullptr;
@@ -48,7 +46,7 @@ void callback(RKLLMResult *result, void *userdata, LLMCallState state)
4846
printf("\n");
4947
} else if (state == RKLLM_RUN_ERROR) {
5048
printf("\\run error\n");
51-
} else if (state == RKLLM_RUN_GET_LAST_HIDDEN_LAYER) {
49+
} else if (state == RKLLM_RUN_NORMAL) {
5250
/* ================================================================================================================
5351
若使用GET_LAST_HIDDEN_LAYER功能,callback接口会回传内存指针:last_hidden_layer,token数量:num_tokens与隐藏层大小:embd_size
5452
通过这三个参数可以取得last_hidden_layer中的数据
@@ -66,7 +64,6 @@ void callback(RKLLMResult *result, void *userdata, LLMCallState state)
6664
std::cerr << "Failed to open the file for writing!" << std::endl;
6765
}
6866
}
69-
} else if (state == RKLLM_RUN_NORMAL) {
7067
printf("%s", result->text);
7168
}
7269
}
@@ -97,6 +94,7 @@ int main(int argc, char **argv)
9794
param.max_context_len = std::atoi(argv[3]);
9895
param.skip_special_token = true;
9996
param.extend_param.base_domain_id = 0;
97+
param.extend_param.embed_flash = 1;
10098

10199
int ret = rkllm_init(&llmHandle, &param, callback);
102100
if (ret == 0){
@@ -118,7 +116,6 @@ int main(int argc, char **argv)
118116
cout << "\n*************************************************************************\n"
119117
<< endl;
120118

121-
string text;
122119
RKLLMInput rkllm_input;
123120

124121
// 初始化 infer 参数结构体
@@ -158,7 +155,15 @@ int main(int argc, char **argv)
158155
// rkllm_load_prompt_cache(llmHandle, "./prompt_cache.bin"); // 加载缓存的cache
159156

160157
rkllm_infer_params.mode = RKLLM_INFER_GENERATE;
161-
158+
// By default, the chat operates in single-turn mode (no context retention)
159+
// 0 means no history is retained, each query is independent
160+
rkllm_infer_params.keep_history = 0;
161+
162+
//The model has a built-in chat template by default, which defines how prompts are formatted
163+
//for conversation. Users can modify this template using this function to customize the
164+
//system prompt, prefix, and postfix according to their needs.
165+
rkllm_set_chat_template(llmHandle, "", "<|User|>", "<|Assistant|>");
166+
162167
while (true)
163168
{
164169
std::string input_str;
@@ -169,6 +174,15 @@ int main(int argc, char **argv)
169174
{
170175
break;
171176
}
177+
if (input_str == "clear")
178+
{
179+
ret = rkllm_clear_kv_cache(llmHandle, 1);
180+
if (ret != 0)
181+
{
182+
printf("clear kv cache failed!\n");
183+
}
184+
continue;
185+
}
172186
for (int i = 0; i < (int)pre_input.size(); i++)
173187
{
174188
if (input_str == to_string(i))
@@ -177,10 +191,8 @@ int main(int argc, char **argv)
177191
cout << input_str << endl;
178192
}
179193
}
180-
text = PROMPT_TEXT_PREFIX + input_str + PROMPT_TEXT_POSTFIX;
181-
// text = input_str;
182194
rkllm_input.input_type = RKLLM_INPUT_PROMPT;
183-
rkllm_input.prompt_input = (char *)text.c_str();
195+
rkllm_input.prompt_input = (char *)input_str.c_str();
184196
printf("robot: ");
185197

186198
// 若要使用普通推理功能,则配置rkllm_infer_mode为RKLLM_INFER_GENERATE或不配置参数

0 commit comments

Comments
 (0)