1818
1919- RK3588 Series
2020- RK3576 Series
21+ - RK3562 Series
2122
2223# Support Models
2324
2627- [x] [ Qwen models] ( https://huggingface.co/models?search=Qwen/Qwen )
2728- [x] [ Phi models] ( https://huggingface.co/models?search=microsoft/phi )
2829- [x] [ ChatGLM3-6B] ( https://huggingface.co/THUDM/chatglm3-6b/tree/103caa40027ebfd8450289ca2f278eac4ff26405 )
29- - [x] [ Gemma models] ( https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315 )
30+ - [x] [ Gemma2] ( https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315 )
31+ - [x] [ Gemma3] ( https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d )
3032- [x] [ InternLM2 models] ( https://huggingface.co/collections/internlm/internlm2-65b0ce04970888799707893c )
3133- [x] [ MiniCPM models] ( https://huggingface.co/collections/openbmb/minicpm-65d48bf958302b9fd25b698f )
3234- [x] [ TeleChat models] ( https://huggingface.co/Tele-AI )
33- - [x] [ Qwen2-VL] ( https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct )
34- - [x] [ MiniCPM-V] ( https://huggingface.co/openbmb/MiniCPM-V-2_6 )
35+ - [x] [ Qwen2-VL-2B-Instruct ] ( https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct )
36+ - [x] [ MiniCPM-V-2_6 ] ( https://huggingface.co/openbmb/MiniCPM-V-2_6 )
3537- [x] [ DeepSeek-R1-Distill] ( https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d )
38+ - [x] [ Janus-Pro-1B] ( https://huggingface.co/deepseek-ai/Janus-Pro-1B )
39+ - [x] [ InternVL2-1B] ( https://huggingface.co/OpenGVLab/InternVL2-1B )
40+ - [x] [ Qwen2.5-VL-3B-Instruct] ( https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct )
3641
3742# Model Performance Benchmark
3843
39- | llm model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform |
40- | :------------- | :--------- | :----: | :---------: | :--------: | :------: | :------: | :-------: | :------: |
41- | TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 |
42- | | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 |
43- | | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 |
44- | | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 |
45- | Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 |
46- | | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 |
47- | | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 |
48- | | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 |
49- | Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 |
50- | | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 |
51- | | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 |
52- | | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 |
53- | ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 |
54- | | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 |
55- | | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 |
56- | | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 |
57- | Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 |
58- | | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 |
59- | | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 |
60- | | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 |
61- | InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 |
62- | | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 |
63- | | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 |
64- | | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 |
65- | MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 |
66- | | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 |
67- | | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 |
68- | | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 |
69- | llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 |
70- | | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 |
71- | | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 |
72- | | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 |
44+ | llm model | platform | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) |
45+ | :------------- | :------: | :--------- | :----: | :---------: | :--------: | :------: | :------: | :-------: |
46+ | Qwen2-0.5B | RK3562 | w4a16_g128 | 64 | 320 | 256 | 524 | 5.67 | 0.39 |
47+ | | RK3562 | w4a8_g32 | 64 | 320 | 256 | 873 | 12.00 | 0.48 |
48+ | | RK3562 | w8a8 | 64 | 320 | 256 | 477 | 11.50 | 0.61 |
49+ | | RK3576 | w4a16 | 64 | 320 | 256 | 204 | 34.50 | 0.40 |
50+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 212 | 32.40 | 0.40 |
51+ | | RK3588 | w8a8 | 64 | 320 | 256 | 79 | 41.50 | 0.62 |
52+ | | RK3588 | w8a8_g128 | 64 | 320 | 256 | 183 | 25.07 | 0.75 |
53+ | TinyLLAMA-1.1B | RK3576 | w4a16 | 64 | 320 | 256 | 345 | 21.10 | 0.77 |
54+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 410 | 18.50 | 0.80 |
55+ | | RK3588 | w8a8 | 64 | 320 | 256 | 140 | 24.21 | 1.25 |
56+ | | RK3588 | w8a8_g512 | 64 | 320 | 256 | 195 | 20.08 | 1.29 |
57+ | Qwen2-1.5B | RK3576 | w4a16 | 64 | 320 | 256 | 512 | 14.40 | 1.75 |
58+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 550 | 12.75 | 1.76 |
59+ | | RK3588 | w8a8 | 64 | 320 | 256 | 206 | 16.46 | 2.47 |
60+ | | RK3588 | w8a8_g128 | 64 | 320 | 256 | 725 | 7.00 | 2.65 |
61+ | Phi-3-3.8B | RK3576 | w4a16 | 64 | 320 | 256 | 975 | 6.60 | 2.16 |
62+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1180 | 5.85 | 2.23 |
63+ | | RK3588 | w8a8 | 64 | 320 | 256 | 516 | 7.44 | 3.88 |
64+ | | RK3588 | w8a8_g512 | 64 | 320 | 256 | 610 | 6.13 | 3.95 |
65+ | ChatGLM3-6B | RK3576 | w4a16 | 64 | 320 | 256 | 1168 | 4.62 | 3.86 |
66+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1583 | 3.82 | 3.96 |
67+ | | RK3588 | w8a8 | 64 | 320 | 256 | 800 | 4.95 | 6.69 |
68+ | | RK3588 | w8a8_g128 | 64 | 320 | 256 | 2190 | 2.70 | 7.18 |
69+ | Gemma2-2B | RK3576 | w4a16 | 64 | 320 | 256 | 628 | 8.00 | 3.63 |
70+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 776 | 7.40 | 3.63 |
71+ | | RK3588 | w8a8 | 64 | 320 | 256 | 342 | 9.67 | 4.84 |
72+ | | RK3588 | w8a8_g128 | 64 | 320 | 256 | 1055 | 5.49 | 5.14 |
73+ | InternLM2-1.8B | RK3576 | w4a16 | 64 | 320 | 256 | 475 | 13.30 | 1.59 |
74+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 572 | 11.95 | 1.62 |
75+ | | RK3588 | w8a8 | 64 | 320 | 256 | 206 | 15.66 | 2.38 |
76+ | | RK3588 | w8a8_g512 | 64 | 320 | 256 | 298 | 12.66 | 2.45 |
77+ | MiniCPM3-4B | RK3576 | w4a16 | 64 | 320 | 256 | 1397 | 4.80 | 2.70 |
78+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 1645 | 4.39 | 2.80 |
79+ | | RK3588 | w8a8 | 64 | 320 | 256 | 702 | 6.15 | 4.65 |
80+ | | RK3588 | w8a8_g128 | 64 | 320 | 256 | 1691 | 3.42 | 5.06 |
81+ | llama3-8B | RK3576 | w4a16 | 64 | 320 | 256 | 1608 | 3.60 | 5.63 |
82+ | | RK3576 | w4a16_g128 | 64 | 320 | 256 | 2010 | 3.00 | 5.76 |
83+ | | RK3588 | w8a8 | 64 | 320 | 256 | 1128 | 3.79 | 9.21 |
84+ | | RK3588 | w8a8_g512 | 64 | 320 | 256 | 1281 | 3.05 | 9.45 |
7385
7486| multimodal model | image input size | vision model dtype | vision infer time(s) | vision memory(MB) | llm model dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | llm memory(G) | platform |
7587| :-------------- | :---------- | :------:| :-----------:| :----------:| :--------:| :--------:| :---------:| :--------:| :---------:| :---------:| :---------:| :---------:|
7890| MiniCPM-V-2_6 | (1, 3, 448, 448) | fp16 | 2.40 | 1031.30 | w4a16 | 128 | 256 | 128 | 2997.70 | 3.84 | 5.50 | RK3576 |
7991| | | fp16 | 3.27 | 976.98 | w8a8 | 128 | 256 | 128 | 1720.60 | 4.13 | 8.88 | RK3588 |
8092
81- - This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0 .
93+ - This performance data were collected based on the maximum CPU and NPU frequencies of each platform.
8294- The script for setting the frequencies is located in the scripts directory.
8395- The vision model were tested based on all NPU core with rknn-toolkit2 version 2.2.0.
8496
97+ # ** Performance Testing Methods**
98+
99+ 1 . Run the frequency-setting script from the ` scripts ` directory on the target platform.
100+ 2 . Execute ` export RKLLM_LOG_LEVEL=1 ` on the device to log model inference performance and memory usage.
101+ 3 . Use the ` eval_perf_watch_cpu.sh ` script to measure CPU utilization.
102+ 4 . Use the ` eval_perf_watch_npu.sh ` script to measure NPU utilization.
103+
85104# Download
86105
871061 . You can download the ** latest package** from [ RKLLM_SDK] ( https://console.zbox.filez.com/l/RJJDmB ) , fetch code: rkllm
921111 . Multimodel deployment demo: [ Qwen2-VL-2B_Demo] ( https://github.com/airockchip/rknn-llm/tree/main/examples/Qwen2-VL-2B_Demo )
931122 . API usage demo: [ DeepSeek-R1-Distill-Qwen-1.5B_Demo] ( https://github.com/airockchip/rknn-llm/tree/main/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo )
941133 . API server demo: [ rkllm_server_demo] ( https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_server_demo )
114+ 4 . Multimodal_Interactive_Dialogue_Demo [ Multimodal_Interactive_Dialogue_Demo] ( https://github.com/airockchip/rknn-llm/tree/main/examples/Multimodal_Interactive_Dialogue_Demo )
95115
96116# Note
97117
98- - The modifications in version 1.1 are significant, making it incompatible with older version models. Please use the latest toolchain for model conversion and inference.
99-
100118- The supported Python versions are:
101-
119+
102120 - Python 3.8
103-
121+ - Python 3.9
104122 - Python 3.10
123+ - Python 3.11
124+ - Python 3.12
125+
126+ ** Note: Before installing package in a Python 3.12 environment, please run the command:**
105127
106- - Latest version: [ <u >v1.1.4] ( https://github.com/airockchip/rknn-llm/releases/tag/release-v1.1.4 ) </u >
128+ ```
129+ export BUILD_CUDA_EXT=0
130+ ```
131+ - On some platforms, you may encounter an error indicating that ** libomp.so** cannot be found. To resolve this, locate the library in the corresponding cross-compilation toolchain and place it in the board's lib directory, at the same level as librkllmrt.so.
132+ - Latest version: [ <u >v1.2.0] ( https://github.com/airockchip/rknn-llm/releases/tag/release-v1.2.0 ) </u >
107133
108134# RKNN Toolkit2
109135
@@ -113,18 +139,20 @@ https://github.com/airockchip/rknn-toolkit2
113139
114140# CHANGELOG
115141
116- ## v1.1.0
117-
118- - Support group-wise quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
119- - Support joint inference with LoRA model loading
120- - Support storage and preloading of prompt cache.
121- - Support gguf model conversion (currently only support q4_0 and fp16).
122- - Optimize initialization, prefill, and decode time.
123- - Support four input types: prompt, embedding, token, and multimodal.
124- - Add PC-based simulation accuracy testing and inference interface support for rkllm-toolkit.
125- - Add gdq algorithm to improve 4-bit quantization accuracy.
126- - Add mixed quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
127- - Add support for models such as Llama3, Gemma2, and MiniCPM3.
128- - Resolve catastrophic forgetting issue when the number of tokens exceeds max_context.
142+ ## v1.2.0
143+
144+ - Supports custom model conversion.
145+ - Supports chat_template configuration.
146+ - Enables multi-turn dialogue interactions.
147+ - Implements automatic prompt cache reuse for improved inference efficiency.
148+ - Expands maximum context length to 16K.
149+ - Supports embedding flash storage to reduce memory usage.
150+ - Introduces the GRQ Int4 quantization algorithm.
151+ - Supports GPTQ-Int8 model conversion.
152+ - Compatible with the RK3562 platform.
153+ - Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
154+ - Supports CPU core configuration.
155+ - Added support for Gemma3
156+ - Added support for Python 3.9/3.11/3.12
129157
130158for older version, please refer [ CHANGELOG] ( CHANGELOG.md )
0 commit comments