@@ -96,12 +96,30 @@ vllm bench throughput --num_prompts 32 --input_len 4096 --output_len 32 --max_mo
9696vllm bench throughput --num_prompts 128 --input_len 32 --output_len 2048 --max_model_len 2080
9797```
9898
99- ### Microbenchmarks
99+ ### Microbenchmarks and roofline model
100100
101- The following set of microbenchmarks measures the roofline peak and observed execution time of
102- a ` ReLU -> Linear ` toy model swept across various (M, K, N) shapes, with the activation
101+ The following set of microbenchmarks show the roofline expected and observed execution times of
102+ a ` ReLU -> Linear ` toy model across a sweep of (M, K, N) shapes, with the activation
103103shaped (M, K) and the weight shaped (K, N). This can be used to estimate expected speedup
104- of quantizing ` torch.nn.Linear ` layers with various recipes based on the activation and weight shapes.
104+ of quantizing ` torch.nn.Linear ` layers with various recipes based on shapes in your model
105+ during inference.
106+
107+ Explanation: to see speedup from quantization of ` activation -> gemm ` during inference, we want
108+
109+ ```
110+ (bf16_activation_time + bf16_gemm_time) > (bf16_activation_and_quantize_tensor_time + fp8_gemm_time)
111+ ```
112+
113+ In a perfect world (and our roofline model),
114+ 1 . ` bf16_activation_time > bf16_activation_and_quantize_tensor_time ` is always true
115+ because ` bf16_activation ` reads+writes ` M*K*2 bytes ` and ` bf16_activation_and_quantize_tensor ` is a single
116+ fused kernel that reads+writes ` M*K*1.5 bytes ` .
117+ 2 . ` bf16_gemm_time ` > ` fp8_gemm_time ` is always true as fp8 gemm has ~ 2x peak efficiency vs bf16 gemm
118+
119+ In the real world, both (1) and (2) are not always true due to kernel launch overhead, kernel efficiency,
120+ lack of fusion for some recipes, etc. Therefore, the observed speedups are often significantly
121+ below the roofline peak. In general you should expect the observed speedup from inference quantization
122+ to increase as MKN increases.
105123
106124#### NVIDIA B200
107125
0 commit comments