Skip to content

Commit f04500f

Browse files
authored
Update inference.md with more context on roofline model (#3991)
1 parent ee3d62a commit f04500f

1 file changed

Lines changed: 22 additions & 4 deletions

File tree

docs/source/workflows/inference.md

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -96,12 +96,30 @@ vllm bench throughput --num_prompts 32 --input_len 4096 --output_len 32 --max_mo
9696
vllm bench throughput --num_prompts 128 --input_len 32 --output_len 2048 --max_model_len 2080
9797
```
9898

99-
### Microbenchmarks
99+
### Microbenchmarks and roofline model
100100

101-
The following set of microbenchmarks measures the roofline peak and observed execution time of
102-
a `ReLU -> Linear` toy model swept across various (M, K, N) shapes, with the activation
101+
The following set of microbenchmarks show the roofline expected and observed execution times of
102+
a `ReLU -> Linear` toy model across a sweep of (M, K, N) shapes, with the activation
103103
shaped (M, K) and the weight shaped (K, N). This can be used to estimate expected speedup
104-
of quantizing `torch.nn.Linear` layers with various recipes based on the activation and weight shapes.
104+
of quantizing `torch.nn.Linear` layers with various recipes based on shapes in your model
105+
during inference.
106+
107+
Explanation: to see speedup from quantization of `activation -> gemm` during inference, we want
108+
109+
```
110+
(bf16_activation_time + bf16_gemm_time) > (bf16_activation_and_quantize_tensor_time + fp8_gemm_time)
111+
```
112+
113+
In a perfect world (and our roofline model),
114+
1. `bf16_activation_time > bf16_activation_and_quantize_tensor_time` is always true
115+
because `bf16_activation` reads+writes `M*K*2 bytes` and `bf16_activation_and_quantize_tensor` is a single
116+
fused kernel that reads+writes `M*K*1.5 bytes`.
117+
2. `bf16_gemm_time` > `fp8_gemm_time` is always true as fp8 gemm has ~2x peak efficiency vs bf16 gemm
118+
119+
In the real world, both (1) and (2) are not always true due to kernel launch overhead, kernel efficiency,
120+
lack of fusion for some recipes, etc. Therefore, the observed speedups are often significantly
121+
below the roofline peak. In general you should expect the observed speedup from inference quantization
122+
to increase as MKN increases.
105123

106124
#### NVIDIA B200
107125

0 commit comments

Comments
 (0)