Commit 7bb7f06
Optimize FP8 colwise scales kernel for AMD GPUs in MoE backward pass (#3972)
Two interdependent changes that together yield ~4.3x end-to-end training
throughput improvement on MI300X for DeepSeek-MoE-16B:
1. Remove redundant .t().contiguous().t() memory copies before calling
triton_fp8_per_group_colwise_scales in the backward pass. The kernel
already handles arbitrary strides via its stride parameters, so these
full-tensor copies are unnecessary.
2. Use larger Triton autotune configs (BLOCK_SIZE=128/256, BLOCK_SIZE_ITER=
128/256) for the colwise scales kernel on AMD GPUs. With row-major input
(from change 1), larger block sizes enable contiguous column access
patterns, reducing grid block count by 4-8x.
Benchmarked on 8x MI300X with DeepSeek-MoE-16B (EP=8, seq_len=4096):
- Batch size 1: 136 TPS -> 642 TPS (4.7x)
- Batch size 4: 500 TPS -> 2153 TPS (4.3x)
Co-authored-by: Li <lizli102@ctr2-alola-login-01.amd.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 4ae435e commit 7bb7f06
2 files changed
Lines changed: 16 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
214 | 214 | | |
215 | 215 | | |
216 | 216 | | |
217 | | - | |
218 | | - | |
219 | | - | |
| 217 | + | |
220 | 218 | | |
221 | 219 | | |
222 | 220 | | |
| |||
225 | 223 | | |
226 | 224 | | |
227 | 225 | | |
228 | | - | |
229 | | - | |
230 | | - | |
| 226 | + | |
231 | 227 | | |
232 | 228 | | |
233 | 229 | | |
| |||
Lines changed: 14 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
52 | 58 | | |
53 | 59 | | |
54 | 60 | | |
| |||
0 commit comments