@@ -11,17 +11,26 @@ The kernel-path bandwidth utility is:
1111python -m benchmarks.prototype.blockwise_fp8_training.benchmark_quant_kernel_bandwidth
1212```
1313
14+ To additionally validate Triton outputs against the Torch reference
15+ implementations:
16+
17+ ``` bash
18+ python -m benchmarks.prototype.blockwise_fp8_training.benchmark_quant_kernel_bandwidth --check-correctness
19+ ```
20+
1421What it reports:
1522
1623- ` kernel_us ` : measured runtime of the public quantization wrapper call
1724- ` effective_logical_io_gbps ` : logical tensor IO bytes divided by measured time
18- - ` logical_io_vs_peak_% ` : ` effective_logical_io_gbps / peak_bandwidth_gbps `
1925- ` logical_io_vs_achievable_% ` : ` effective_logical_io_gbps / achievable_bandwidth_gbps `
2026
2127Notes:
2228
2329- The benchmark times the public wrapper functions in
2430 ` torchao.prototype.blockwise_fp8_training.kernels ` .
31+ - ` --check-correctness ` runs the matching Torch reference path once per valid
32+ kernel and shape before reporting results. This adds overhead and is intended
33+ for validation, not headline timing runs.
2534- The bandwidth number uses the expected tensor IO footprint, not hardware DRAM
2635 counters.
2736- Peak bandwidth defaults to CUDA device properties. ` --use-roofline-utils `
@@ -51,20 +60,21 @@ Environment:
5160- Peak bandwidth reference: ` 3352.3 GB/s `
5261- Peak bandwidth source: ` cuda_device_properties `
5362- Achievable bandwidth reference: ` 3084.1 GB/s `
63+ - Achievable bandwidth uses ` 92.0% ` of peak bandwidth
5464- Achievable bandwidth source: ` roofline_utils_pct_achievable_mem_bw `
5565
5666### Per-shape Results
5767Tested with shapes 32768 and 131072 to reflect real world training:
5868
59- | kernel | shape | kernel_us | effective_logical_io_gbps | logical_io_vs_peak _ % | logical_io_vs_achievable_ % |
60- | ---| ---| ---:| ---:| ---:| ---: |
61- | act_quant_transposed_lhs | 32768x4096 | 154.46 | 2633.9 | 78.6 | 85.4 |
62- | weight_quant_transposed_rhs | 32768x4096 | 150.53 | 2675.2 | 79.8 | 86.7 |
63- | act_quant_lhs | 32768x4096 | 150.86 | 2696.8 | 80.4 | 87.4 |
64- | act_quant_rhs | 32768x4096 | 148.70 | 2736.0 | 81.6 | 88.7 |
65- | weight_quant_rhs | 32768x4096 | 144.99 | 2777.3 | 82.8 | 90.1 |
66- | weight_quant_transposed_rhs | 131072x4096 | 581.89 | 2768.1 | 82.6 | 89.8 |
67- | act_quant_lhs | 131072x4096 | 586.98 | 2772.5 | 82.7 | 89.9 |
68- | act_quant_transposed_lhs | 131072x4096 | 581.47 | 2798.7 | 83.5 | 90.7 |
69- | act_quant_rhs | 131072x4096 | 562.56 | 2892.8 | 86.3 | 93.8 |
70- | weight_quant_rhs | 131072x4096 | 555.30 | 2900.7 | 86.5 | 94.1 |
69+ | kernel | shape | kernel_us | effective_logical_io_gbps | logical_io_vs_achievable_ % |
70+ | ---| ---| ---:| ---:| ---:|
71+ | act_quant_transposed_lhs | 32768x4096 | 154.46 | 2633.9 | 85.4 |
72+ | weight_quant_transposed_rhs | 32768x4096 | 150.53 | 2675.2 | 86.7 |
73+ | act_quant_lhs | 32768x4096 | 150.86 | 2696.8 | 87.4 |
74+ | act_quant_rhs | 32768x4096 | 148.70 | 2736.0 | 88.7 |
75+ | weight_quant_rhs | 32768x4096 | 144.99 | 2777.3 | 90.1 |
76+ | weight_quant_transposed_rhs | 131072x4096 | 581.89 | 2768.1 | 89.8 |
77+ | act_quant_lhs | 131072x4096 | 586.98 | 2772.5 | 89.9 |
78+ | act_quant_transposed_lhs | 131072x4096 | 581.47 | 2798.7 | 90.7 |
79+ | act_quant_rhs | 131072x4096 | 562.56 | 2892.8 | 93.8 |
80+ | weight_quant_rhs | 131072x4096 | 555.30 | 2900.7 | 94.1 |
0 commit comments