|
32 | 32 | - [x] [TeleChat models](https://huggingface.co/Tele-AI) |
33 | 33 | - [x] [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) |
34 | 34 | - [x] [MiniCPM-V](https://huggingface.co/openbmb/MiniCPM-V-2_6) |
| 35 | +- [x] [DeepSeek-R1-Distill](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) |
35 | 36 |
|
36 | 37 | # Model Performance Benchmark |
37 | 38 |
|
38 | | -| model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform | |
39 | | -|:-------------- |:---------- |:------:|:-----------:|:----------:|:--------:|:--------:|:---------:|:--------:| |
40 | | -| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 | |
41 | | -| | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 | |
42 | | -| | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 | |
43 | | -| | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 | |
44 | | -| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 | |
45 | | -| | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 | |
46 | | -| | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 | |
47 | | -| | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 | |
48 | | -| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 | |
49 | | -| | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 | |
50 | | -| | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 | |
51 | | -| | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 | |
52 | | -| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 | |
53 | | -| | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 | |
54 | | -| | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 | |
55 | | -| | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 | |
56 | | -| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 | |
57 | | -| | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 | |
58 | | -| | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 | |
59 | | -| | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 | |
60 | | -| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 | |
61 | | -| | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 | |
62 | | -| | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 | |
63 | | -| | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 | |
64 | | -| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 | |
65 | | -| | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 | |
66 | | -| | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 | |
67 | | -| | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 | |
68 | | -| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 | |
69 | | -| | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 | |
70 | | -| | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 | |
71 | | -| | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 | |
| 39 | +| llm model | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) | platform | |
| 40 | +| :------------- | :--------- | :----: | :---------: | :--------: | :------: | :------: | :-------: | :------: | |
| 41 | +| TinyLLAMA-1.1B | w4a16 | 64 | 320 | 256 | 345.00 | 21.10 | 0.77 | RK3576 | |
| 42 | +| | w4a16_g128 | 64 | 320 | 256 | 410.00 | 18.50 | 0.8 | RK3576 | |
| 43 | +| | w8a8 | 64 | 320 | 256 | 140.46 | 24.21 | 1.25 | RK3588 | |
| 44 | +| | w8a8_g512 | 64 | 320 | 256 | 195.00 | 20.08 | 1.29 | RK3588 | |
| 45 | +| Qwen2-1.5B | w4a16 | 64 | 320 | 256 | 512.00 | 14.40 | 1.75 | RK3576 | |
| 46 | +| | w4a16_g128 | 64 | 320 | 256 | 550.00 | 12.75 | 1.76 | RK3576 | |
| 47 | +| | w8a8 | 64 | 320 | 256 | 206.00 | 16.46 | 2.47 | RK3588 | |
| 48 | +| | w8a8_g128 | 64 | 320 | 256 | 725.00 | 7.00 | 2.65 | RK3588 | |
| 49 | +| Phi-3-3.8B | w4a16 | 64 | 320 | 256 | 975.00 | 6.60 | 2.16 | RK3576 | |
| 50 | +| | w4a16_g128 | 64 | 320 | 256 | 1180.00 | 5.85 | 2.23 | RK3576 | |
| 51 | +| | w8a8 | 64 | 320 | 256 | 516.00 | 7.44 | 3.88 | RK3588 | |
| 52 | +| | w8a8_g512 | 64 | 320 | 256 | 610.00 | 6.13 | 3.95 | RK3588 | |
| 53 | +| ChatGLM3-6B | w4a16 | 64 | 320 | 256 | 1168.00 | 4.62 | 3.86 | RK3576 | |
| 54 | +| | w4a16_g128 | 64 | 320 | 256 | 1582.56 | 3.82 | 3.96 | RK3576 | |
| 55 | +| | w8a8 | 64 | 320 | 256 | 800.00 | 4.95 | 6.69 | RK3588 | |
| 56 | +| | w8a8_g128 | 64 | 320 | 256 | 2190.00 | 2.70 | 7.18 | RK3588 | |
| 57 | +| Gemma2-2B | w4a16 | 64 | 320 | 256 | 628.00 | 8.00 | 3.63 | RK3576 | |
| 58 | +| | w4a16_g128 | 64 | 320 | 256 | 776.20 | 7.40 | 3.63 | RK3576 | |
| 59 | +| | w8a8 | 64 | 320 | 256 | 342.29 | 9.67 | 4.84 | RK3588 | |
| 60 | +| | w8a8_g128 | 64 | 320 | 256 | 1055.00 | 5.49 | 5.14 | RK3588 | |
| 61 | +| InternLM2-1.8B | w4a16 | 64 | 320 | 256 | 475.00 | 13.30 | 1.59 | RK3576 | |
| 62 | +| | w4a16_g128 | 64 | 320 | 256 | 572.00 | 11.95 | 1.62 | RK3576 | |
| 63 | +| | w8a8 | 64 | 320 | 256 | 205.97 | 15.66 | 2.38 | RK3588 | |
| 64 | +| | w8a8_g512 | 64 | 320 | 256 | 298.00 | 12.66 | 2.45 | RK3588 | |
| 65 | +| MiniCPM3-4B | w4a16 | 64 | 320 | 256 | 1397.00 | 4.80 | 2.7 | RK3576 | |
| 66 | +| | w4a16_g128 | 64 | 320 | 256 | 1645.00 | 4.39 | 2.8 | RK3576 | |
| 67 | +| | w8a8 | 64 | 320 | 256 | 702.18 | 6.15 | 4.65 | RK3588 | |
| 68 | +| | w8a8_g128 | 64 | 320 | 256 | 1691.00 | 3.42 | 5.06 | RK3588 | |
| 69 | +| llama3-8B | w4a16 | 64 | 320 | 256 | 1607.98 | 3.60 | 5.63 | RK3576 | |
| 70 | +| | w4a16_g128 | 64 | 320 | 256 | 2010.00 | 3.00 | 5.76 | RK3576 | |
| 71 | +| | w8a8 | 64 | 320 | 256 | 1128.00 | 3.79 | 9.21 | RK3588 | |
| 72 | +| | w8a8_g512 | 64 | 320 | 256 | 1281.35 | 3.05 | 9.45 | RK3588 | |
| 73 | + |
| 74 | +| multimodal model | image input size | vision model dtype | vision infer time(s) | vision memory(MB) | llm model dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | llm memory(G) | platform | |
| 75 | +|:-------------- |:---------- |:------:|:-----------:|:----------:|:--------:|:--------:|:---------:|:--------:|:---------:|:---------:|:---------:|:---------:| |
| 76 | +| Qwen2-VL-2B | (1, 3, 392, 392) | fp16 | 3.55 | 1436.52 | w4a16 | 256 | 384 | 128 | 2094.17 | 13.23 | 1.75 | RK3576 | |
| 77 | +| | | fp16 | 3.28 | 1436.52 | w8a8 | 256 | 384 | 128 | 856.86 | 16.19 | 2.47 | RK3588 | |
| 78 | +| MiniCPM-V-2_6 | (1, 3, 448, 448) | fp16 | 2.40 | 1031.30 | w4a16 | 128 | 256 | 128 | 2997.70 | 3.84 | 5.50 | RK3576 | |
| 79 | +| | | fp16 | 3.27 | 976.98 | w8a8 | 128 | 256 | 128 | 1720.60 | 4.13 | 8.88 | RK3588 | |
72 | 80 |
|
73 | 81 | - This performance data were collected based on the maximum CPU and NPU frequencies of each platform with version 1.1.0. |
74 | 82 | - The script for setting the frequencies is located in the scripts directory. |
| 83 | +- The vision model were tested based on all NPU core with rknn-toolkit2 version 2.2.0. |
75 | 84 |
|
76 | 85 | # Download |
77 | 86 |
|
78 | | -You can download the latest package, docker image, example, documentation, and platform-tool from [RKLLM_SDK](https://console.zbox.filez.com/l/RJJDmB), fetch code: rkllm |
| 87 | +1. You can download the **latest package** from [RKLLM_SDK](https://console.zbox.filez.com/l/RJJDmB), fetch code: rkllm |
| 88 | +2. You can download the **converted rkllm model** from [rkllm_model_zoo](https://console.box.lenovo.com/l/l0tXb8), fetch code: rkllm |
79 | 89 |
|
80 | 90 | # Examples |
81 | 91 |
|
82 | | -1. Multimodel deployment demo: [rkllm_multimodel_demo](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_multimodel_demo) |
83 | | -2. API usage demo: [rkllm_api_demo](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo) |
| 92 | +1. Multimodel deployment demo: [Qwen2-VL-2B_Demo](https://github.com/airockchip/rknn-llm/tree/main/examples/Qwen2-VL-2B_Demo) |
| 93 | +2. API usage demo: [DeepSeek-R1-Distill-Qwen-1.5B_Demo](https://github.com/airockchip/rknn-llm/tree/main/examples/DeepSeek-R1-Distill-Qwen-1.5B_Demo) |
84 | 94 | 3. API server demo: [rkllm_server_demo](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_server_demo) |
85 | 95 |
|
86 | 96 | # Note |
@@ -117,4 +127,4 @@ https://github.com/airockchip/rknn-toolkit2 |
117 | 127 | - Add support for models such as Llama3, Gemma2, and MiniCPM3. |
118 | 128 | - Resolve catastrophic forgetting issue when the number of tokens exceeds max_context. |
119 | 129 |
|
120 | | -for older version, please refer [CHANGELOG](CHANGELOG.md) |
| 130 | +for older version, please refer [CHANGELOG](CHANGELOG.md) |
0 commit comments