Skip to content

Commit 2c4ecb0

Browse files
authored
remove per kernel fp8 triton quantized benchmarks (#4166)
1 parent 4611835 commit 2c4ecb0

7 files changed

Lines changed: 169 additions & 1430 deletions

benchmarks/prototype/blockwise_fp8_training/README.md

Lines changed: 23 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,26 @@ The kernel-path bandwidth utility is:
1111
python -m benchmarks.prototype.blockwise_fp8_training.benchmark_quant_kernel_bandwidth
1212
```
1313

14+
To additionally validate Triton outputs against the Torch reference
15+
implementations:
16+
17+
```bash
18+
python -m benchmarks.prototype.blockwise_fp8_training.benchmark_quant_kernel_bandwidth --check-correctness
19+
```
20+
1421
What it reports:
1522

1623
- `kernel_us`: measured runtime of the public quantization wrapper call
1724
- `effective_logical_io_gbps`: logical tensor IO bytes divided by measured time
18-
- `logical_io_vs_peak_%`: `effective_logical_io_gbps / peak_bandwidth_gbps`
1925
- `logical_io_vs_achievable_%`: `effective_logical_io_gbps / achievable_bandwidth_gbps`
2026

2127
Notes:
2228

2329
- The benchmark times the public wrapper functions in
2430
`torchao.prototype.blockwise_fp8_training.kernels`.
31+
- `--check-correctness` runs the matching Torch reference path once per valid
32+
kernel and shape before reporting results. This adds overhead and is intended
33+
for validation, not headline timing runs.
2534
- The bandwidth number uses the expected tensor IO footprint, not hardware DRAM
2635
counters.
2736
- Peak bandwidth defaults to CUDA device properties. `--use-roofline-utils`
@@ -51,20 +60,21 @@ Environment:
5160
- Peak bandwidth reference: `3352.3 GB/s`
5261
- Peak bandwidth source: `cuda_device_properties`
5362
- Achievable bandwidth reference: `3084.1 GB/s`
63+
- Achievable bandwidth uses `92.0%` of peak bandwidth
5464
- Achievable bandwidth source: `roofline_utils_pct_achievable_mem_bw`
5565

5666
### Per-shape Results
5767
Tested with shapes 32768 and 131072 to reflect real world training:
5868

59-
| kernel | shape | kernel_us | effective_logical_io_gbps | logical_io_vs_peak_% | logical_io_vs_achievable_% |
60-
|---|---|---:|---:|---:|---:|
61-
| act_quant_transposed_lhs | 32768x4096 | 154.46 | 2633.9 | 78.6 | 85.4 |
62-
| weight_quant_transposed_rhs | 32768x4096 | 150.53 | 2675.2 | 79.8 | 86.7 |
63-
| act_quant_lhs | 32768x4096 | 150.86 | 2696.8 | 80.4 | 87.4 |
64-
| act_quant_rhs | 32768x4096 | 148.70 | 2736.0 | 81.6 | 88.7 |
65-
| weight_quant_rhs | 32768x4096 | 144.99 | 2777.3 | 82.8 | 90.1 |
66-
| weight_quant_transposed_rhs | 131072x4096 | 581.89 | 2768.1 | 82.6 | 89.8 |
67-
| act_quant_lhs | 131072x4096 | 586.98 | 2772.5 | 82.7 | 89.9 |
68-
| act_quant_transposed_lhs | 131072x4096 | 581.47 | 2798.7 | 83.5 | 90.7 |
69-
| act_quant_rhs | 131072x4096 | 562.56 | 2892.8 | 86.3 | 93.8 |
70-
| weight_quant_rhs | 131072x4096 | 555.30 | 2900.7 | 86.5 | 94.1 |
69+
| kernel | shape | kernel_us | effective_logical_io_gbps | logical_io_vs_achievable_% |
70+
|---|---|---:|---:|---:|
71+
| act_quant_transposed_lhs | 32768x4096 | 154.46 | 2633.9 | 85.4 |
72+
| weight_quant_transposed_rhs | 32768x4096 | 150.53 | 2675.2 | 86.7 |
73+
| act_quant_lhs | 32768x4096 | 150.86 | 2696.8 | 87.4 |
74+
| act_quant_rhs | 32768x4096 | 148.70 | 2736.0 | 88.7 |
75+
| weight_quant_rhs | 32768x4096 | 144.99 | 2777.3 | 90.1 |
76+
| weight_quant_transposed_rhs | 131072x4096 | 581.89 | 2768.1 | 89.8 |
77+
| act_quant_lhs | 131072x4096 | 586.98 | 2772.5 | 89.9 |
78+
| act_quant_transposed_lhs | 131072x4096 | 581.47 | 2798.7 | 90.7 |
79+
| act_quant_rhs | 131072x4096 | 562.56 | 2892.8 | 93.8 |
80+
| weight_quant_rhs | 131072x4096 | 555.30 | 2900.7 | 94.1 |

benchmarks/prototype/blockwise_fp8_training/bench_triton_fp8_blockwise_act_quant_lhs.py

Lines changed: 0 additions & 224 deletions
This file was deleted.

0 commit comments

Comments
 (0)