Optimize FP8 colwise scales kernel for AMD GPUs in MoE backward pass (#3972)

lizamd · Li · claude · web-flow · commit 7bb7f061ad4d · 2026-03-02T13:21:04.000-08:00
Two interdependent changes that together yield ~4.3x end-to-end training
throughput improvement on MI300X for DeepSeek-MoE-16B:

1. Remove redundant .t().contiguous().t() memory copies before calling
   triton_fp8_per_group_colwise_scales in the backward pass. The kernel
   already handles arbitrary strides via its stride parameters, so these
   full-tensor copies are unnecessary.

2. Use larger Triton autotune configs (BLOCK_SIZE=128/256, BLOCK_SIZE_ITER=
   128/256) for the colwise scales kernel on AMD GPUs. With row-major input
   (from change 1), larger block sizes enable contiguous column access
   patterns, reducing grid block count by 4-8x.

Benchmarked on 8x MI300X with DeepSeek-MoE-16B (EP=8, seq_len=4096):
- Batch size 1: 136 TPS -&gt; 642 TPS (4.7x)
- Batch size 4: 500 TPS -&gt; 2153 TPS (4.3x)

Co-authored-by: Li &lt;lizli102@ctr2-alola-login-01.amd.com&gt;
Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/torchao/prototype/moe_training/fp8_grouped_mm.py b/torchao/prototype/moe_training/fp8_grouped_mm.py
@@ -214,9 +214,7 @@ def backward(ctx, grad_output: torch.Tensor):
         # needed for grad_B: grad_output_t @ A
         # Use transpose method to avoid uncoalesced memory accesses.
         grad_out_data_colwise, grad_out_scales = triton_fp8_per_group_colwise_scales(
-            grad_output.t()
-            .contiguous()
-            .t(),  # Quantization is over 2x faster when input is col major, even with this transformation
+            grad_output,
             offs,
             float8_dtype,
             round_scales_to_power_of_2=True,
@@ -225,9 +223,7 @@ def backward(ctx, grad_output: torch.Tensor):
         grad_output_t_scales = grad_out_scales.t()
 
         A_data_col_major, A_scales = triton_fp8_per_group_colwise_scales(
-            A.t()
-            .contiguous()
-            .t(),  # Quantization is over 2x faster when input is col major, even with this transformation
+            A,
             offs,
             float8_dtype,
             round_scales_to_power_of_2=True,
diff --git a/torchao/prototype/moe_training/kernels/jagged_float8_scales.py b/torchao/prototype/moe_training/kernels/jagged_float8_scales.py
@@ -41,14 +41,20 @@
     if torch.version.hip is not None:
         kernel_configs_2D = [
             triton.Config(
-                {"BLOCK_SIZE": block_size, "BLOCK_SIZE_ITER": block_size_iter},
-                num_warps=warps,
-                num_stages=stages,
-            )
-            for block_size in [32, 64]
-            for block_size_iter in [64, 128]
-            for warps in [4, 8]
-            for stages in [2, 3]
+                {"BLOCK_SIZE": 128, "BLOCK_SIZE_ITER": 128},
+                num_warps=8,
+                num_stages=2,
+            ),
+            triton.Config(
+                {"BLOCK_SIZE": 128, "BLOCK_SIZE_ITER": 256},
+                num_warps=8,
+                num_stages=2,
+            ),
+            triton.Config(
+                {"BLOCK_SIZE": 256, "BLOCK_SIZE_ITER": 128},
+                num_warps=8,
+                num_stages=2,
+            ),
         ]
     else:
         kernel_configs_2D = [