[mxfp8 moe training] add fused dim0+dim1 mxfp8 quantize kernel for backward pass

MagellaX · MagellaX · commit b66957512b75 · 2026-04-19T01:37:19.000+05:30
Adds ``triton_mxfp8_quantize_dim0_dim1`` (new file
``kernels/mxfp8/triton_grad_quantize.py``): a single-pass Triton kernel that
reads the bf16 ``grad_out`` tile once and emits four outputs in one launch:

  * ``qdata_dim0``  - ``(M, N)`` e4m3 row-major, rowwise scales along N
  * ``qdata_dim1_t``- ``(N, M)`` e4m3 row-major (logical transpose of the
                      colwise-quantized tile), colwise scales along M
  * both e8m0 scale tensors already in the tcgen05 blocked layout
    (``triton_mx_block_rearrange`` output, no separate rearrange pass needed)

Replaces the current backward-pass sequence of ``triton_to_mxfp8_dim0 +
triton_to_mxfp8_dim1 + 2x triton_mx_block_rearrange_2d_M_groups`` (four
Triton launches, bf16 read twice, scale tensor written twice).

Key implementation points:

  * Single bf16 tile load; both rowwise and colwise scales are computed via
    pure 3D reshapes of the SAME tile (no bf16 transpose).
  * No ``tl.trans`` on fp8 either: the dim1 output is stored with
    ``offset = n * M + m`` directly on the (BLOCK_M, BLOCK_N) tile, matching
    how the existing ``to_mxfp8_dim1_kernel`` handles its col-major write.
  * Scales are emitted directly in tcgen05 blocked layout by computing the
    ``(r % 32) * 16 + (r // 32) * 4 + c`` super-tile offset for every scale
    element and issuing one ``tl.store`` per scale tensor.
  * Autotuned over BLOCK_N in {128, 256, 512} x num_warps in {4, 8, 16} x
    num_stages in {2, 3, 4} with an ``early_config_prune`` that drops
    configs where BLOCK_N &gt; N.

Numerics: new ``test_triton_mxfp8_quantize_dim0_dim1_numerics`` parametrized
over 7 shapes x {rceil, floor} asserts bit-exact parity of both fp8 data
tensors and both blocked scales against the 4-kernel reference pipeline.
14/14 pass on B200 (SM 10.0).

Benchmark (``bench_triton_grad_quantize.py``; do_bench median, B200, rceil):

    M      N    4k_us   fused_us   4k_GB/s   fused_GB/s   speedup
 16384   2048    340.0      120.9       401         1128   2.81x
  4096   2048    271.8       39.0       125          873   6.96x
  8192   2048    291.4       65.6       234         1038   4.44x
 32768   2048    451.5      233.5       604         1167   1.93x
 16384   5120    459.6      292.8       742         1164   1.57x
 16384   7168    547.8      405.5       871         1177   1.35x
  8192   5120    358.4      151.6       475         1124   2.36x
  8192   7168    394.2      209.9       605         1136   1.88x
 32768   5120    761.7      574.6       895         1186   1.33x

v1 lands ~1.1-1.2 TB/s (~20-32% of B200 bf16 memcpy, which measures
~5 TB/s on this rig). That is a consistent 1.33-6.96x over the 4-kernel
baseline but still short of the "90% memcpy BW" bar -- the kernel is
correct and drop-in-compatible, and the remaining headroom is in TMA /
warp-specialized stores which are worth their own follow-up diff.

Made-with: Cursor
diff --git a/benchmarks/prototype/moe_training/mxfp8/bench_triton_grad_quantize.py b/benchmarks/prototype/moe_training/mxfp8/bench_triton_grad_quantize.py
@@ -0,0 +1,222 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD 3-Clause license found in the
+# LICENSE file in the root directory of this source tree.
+
+"""Benchmark: fused ``triton_mxfp8_quantize_dim0_dim1`` vs. today's 4-kernel
+backward-pass sequence (dim0 quantize + dim0 rearrange + dim1 quantize + dim1
+rearrange).
+
+Target (review feedback on ao#4293): close the gap to the B200 bf16->fp8
+memcpy ceiling at the DeepSeek-V3-like backward-pass shape
+``(num_groups=4, M_per_group=4096, N=2048)`` -> ``(total_M=16384, N=2048)``
+and the adjacent sweep. Landing bar is >= 5 TB/s on the realistic shapes.
+Measured B200 bf16 memcpy is ~5 TB/s on this rig (the ``_memcpy_bf16_bw_gbps``
+helper below measures it live), so ``% memcpy BW`` is the relevant metric.
+"""
+
+import argparse
+from dataclasses import dataclass
+from typing import List
+
+import torch
+from tabulate import tabulate
+from tqdm import tqdm
+
+from benchmarks.utils import benchmark_cuda_function_in_microseconds
+from torchao.prototype.moe_training.kernels.mxfp8 import (
+    triton_mx_block_rearrange_2d_M_groups,
+    triton_mxfp8_quantize_dim0_dim1,
+)
+from torchao.prototype.mx_formats.kernels import (
+    triton_to_mxfp8_dim0,
+    triton_to_mxfp8_dim1,
+)
+from torchao.utils import ceil_div
+
+device = torch.device("cuda")
+
+torch._dynamo.config.cache_size_limit = 1000
+
+
+@dataclass(frozen=True)
+class ExperimentConfig:
+    M: int
+    N: int
+
+
+@dataclass(frozen=True)
+class ExperimentResult:
+    four_kernel_us: float
+    fused_us: float
+    four_kernel_gbps: float
+    fused_gbps: float
+    speedup: float
+    memcpy_bw_pct: float
+
+
+@dataclass(frozen=True)
+class Experiment:
+    config: ExperimentConfig
+    result: ExperimentResult
+
+
+def get_configs() -> List[ExperimentConfig]:
+    # Daniel's target shape first, then a sweep covering realistic backward
+    # pass sizes: (num_groups, M_per_group, N) in {(4, 4096, 2048), (4, 4096,
+    # 5120), (8, 4096, 2048), (4, 8192, 2048), ...} as flat (total_M, N).
+    pairs = [
+        # (total_M, N)  - Daniel's primary landing target
+        (16384, 2048),
+        # DeepSeek-V3-like sweeps
+        (4096, 2048),
+        (8192, 2048),
+        (32768, 2048),
+        (16384, 5120),
+        (16384, 7168),
+        (8192, 5120),
+        (8192, 7168),
+        (32768, 5120),
+    ]
+    return [ExperimentConfig(M=m, N=n) for m, n in pairs]
+
+
+def _four_kernel_reference(x: torch.Tensor):
+    """Today's backward-pass path: dim0 quant + dim0 rearrange + dim1 quant +
+    dim1 rearrange. Runs the same four Triton kernels that the production
+    ``_MXFP8GroupedMM.backward`` invokes today (minus the cross-group
+    bookkeeping, which is the same constant overhead on either side)."""
+    M, N = x.shape
+    qdata0, scales0_rm = triton_to_mxfp8_dim0(
+        x, inner_block_size=32, scaling_mode="rceil"
+    )
+    qdata1_t, scales1_rm = triton_to_mxfp8_dim1(
+        x, inner_block_size=32, scaling_mode="rceil"
+    )
+    # We pass a single-group offset so the rearrange kernel still runs its
+    # standard launch, matching the production pad-and-blocked-scale path.
+    one_group_m = torch.tensor([M], dtype=torch.int32, device=x.device)
+    one_group_n = torch.tensor([N], dtype=torch.int32, device=x.device)
+    scales0_b = triton_mx_block_rearrange_2d_M_groups(scales0_rm, one_group_m)
+    scales1_b = triton_mx_block_rearrange_2d_M_groups(scales1_rm, one_group_n)
+    return qdata0, qdata1_t, scales0_b, scales1_b
+
+
+def _bytes_touched(M: int, N: int) -> int:
+    """Bytes of HBM that MUST flow for this problem: one bf16 read of the
+    input, two fp8 writes (row-major + transposed), and two e8m0 scale
+    writes in blocked layout. This is the memcpy lower bound."""
+    scale_cols_n = ceil_div(N // 32, 4) * 4
+    scale_cols_m = ceil_div(M // 32, 4) * 4
+    return (
+        M * N * 2  # bf16 read
+        + M * N * 1  # fp8 dim0 write
+        + N * M * 1  # fp8 dim1_t write
+        + M * scale_cols_n * 1  # e8m0 dim0 scales (blocked)
+        + N * scale_cols_m * 1  # e8m0 dim1 scales (blocked)
+    )
+
+
+def _memcpy_bf16_bw_gbps(M: int, N: int) -> float:
+    """Rough B200 bf16 memcpy bandwidth estimate (read + write of an (M, N)
+    bf16 tensor, sustained). Measured in-process so the number reflects the
+    current machine, not a datasheet constant."""
+    x = torch.randn(M, N, dtype=torch.bfloat16, device=device)
+
+    def memcpy():
+        return x.clone()
+
+    for _ in range(5):
+        memcpy()
+    us = benchmark_cuda_function_in_microseconds(memcpy)
+    # clone reads and writes the full tensor -> 2 * M * N * 2 bytes.
+    bytes_touched = 2 * M * N * 2
+    return (bytes_touched / 1e9) / (us / 1e6)
+
+
+def run_experiment(
+    config: ExperimentConfig, args: argparse.Namespace
+) -> ExperimentResult:
+    M, N = config.M, config.N
+
+    torch.manual_seed(42)
+    x = torch.randn(M, N, dtype=torch.bfloat16, device=device)
+
+    def run_4kernel():
+        return _four_kernel_reference(x)
+
+    def run_fused():
+        return triton_mxfp8_quantize_dim0_dim1(x, scaling_mode="rceil")
+
+    for _ in range(5):
+        run_4kernel()
+        run_fused()
+
+    four_kernel_us = benchmark_cuda_function_in_microseconds(run_4kernel)
+    fused_us = benchmark_cuda_function_in_microseconds(run_fused)
+
+    bytes_total = _bytes_touched(M, N)
+    four_kernel_gbps = (bytes_total / 1e9) / (four_kernel_us / 1e6)
+    fused_gbps = (bytes_total / 1e9) / (fused_us / 1e6)
+
+    # Express fused BW as a % of the bf16 memcpy ceiling at the same input
+    # size. This is the "90%+ memcpy BW" metric Daniel asked for.
+    memcpy_gbps = _memcpy_bf16_bw_gbps(M, N)
+    memcpy_bw_pct = 100.0 * fused_gbps / memcpy_gbps if memcpy_gbps > 0 else 0.0
+
+    return ExperimentResult(
+        four_kernel_us=four_kernel_us,
+        fused_us=fused_us,
+        four_kernel_gbps=four_kernel_gbps,
+        fused_gbps=fused_gbps,
+        speedup=four_kernel_us / fused_us,
+        memcpy_bw_pct=memcpy_bw_pct,
+    )
+
+
+def print_results(experiments: List[Experiment]):
+    headers = [
+        "M",
+        "N",
+        "4k_us",
+        "fused_us",
+        "4k_GB/s",
+        "fused_GB/s",
+        "speedup",
+        "% memcpy BW",
+    ]
+    rows = []
+    for exp in experiments:
+        rows.append(
+            [
+                exp.config.M,
+                exp.config.N,
+                f"{exp.result.four_kernel_us:.1f}",
+                f"{exp.result.fused_us:.1f}",
+                f"{exp.result.four_kernel_gbps:.0f}",
+                f"{exp.result.fused_gbps:.0f}",
+                f"{exp.result.speedup:.2f}x",
+                f"{exp.result.memcpy_bw_pct:.1f}%",
+            ]
+        )
+    print(tabulate(rows, headers=headers))
+
+
+def main(args: argparse.Namespace):
+    torch.random.manual_seed(123)
+    configs = get_configs()
+    results = []
+    for config in tqdm(configs):
+        result = run_experiment(config, args)
+        results.append(Experiment(config=config, result=result))
+    print_results(results)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--profile", action="store_true", help="Enable profiling with PyTorch profiler"
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/test/prototype/moe_training/test_kernels.py b/test/prototype/moe_training/test_kernels.py
@@ -51,6 +51,7 @@ def _is_sm_10x() -> bool:
     triton_mx_block_rearrange_per_group_3d,
     triton_mxfp8_dispatch_and_quantize,
     triton_mxfp8_pad_and_quantize,
+    triton_mxfp8_quantize_dim0_dim1,
 )
 from torchao.prototype.moe_training.kernels.mxfp8.quant import (
     _mxfp8_cuda_kernels_available,
@@ -885,6 +886,95 @@ def test_triton_mxfp8_dispatch_and_quantize_numerics(
     )
 
 
+@pytest.mark.skipif(
+    not _is_sm_10x(),
+    reason="requires CUDA SM 10.x (blocked scale GEMM hw)",
+)
+@skip_if_rocm("ROCm enablement in progress")
+@pytest.mark.parametrize(
+    "M,N",
+    [
+        (128, 128),
+        (256, 512),
+        (1024, 2048),
+        (4096, 2048),
+        (8192, 2048),
+        (16384, 2048),
+        (2048, 5120),
+    ],
+)
+@pytest.mark.parametrize("scaling_mode_str", ["rceil", "floor"])
+def test_triton_mxfp8_quantize_dim0_dim1_numerics(
+    M: int, N: int, scaling_mode_str: str
+):
+    """Fused dim0+dim1 MXFP8 quantization with blocked scales should be
+    bit-exactly equivalent to the decoupled 4-kernel reference pipeline:
+
+        qdata0, scales0_rm = triton_to_mxfp8_dim0(x, 32, mode)
+        qdata1_t, scales1_rm = triton_to_mxfp8_dim1(x, 32, mode)
+        scales0_blocked = triton_mx_block_rearrange_2d_M_groups(scales0_rm, [M])
+        scales1_blocked = triton_mx_block_rearrange_2d_M_groups(scales1_rm, [N])
+    """
+    from torchao.prototype.mx_formats.kernels import (
+        triton_to_mxfp8_dim0,
+        triton_to_mxfp8_dim1,
+    )
+
+    device = "cuda"
+    torch.manual_seed(2024)
+    x = torch.randn(M, N, dtype=torch.bfloat16, device=device)
+
+    # Fused kernel under test.
+    qdata0_fused, qdata1_t_fused, scales0_fused, scales1_fused = (
+        triton_mxfp8_quantize_dim0_dim1(x, scaling_mode=scaling_mode_str)
+    )
+
+    # Reference dim0 pipeline: quantize along N, then blocked-rearrange.
+    qdata0_ref, scales0_ref_rm = triton_to_mxfp8_dim0(
+        x, inner_block_size=32, scaling_mode=scaling_mode_str
+    )
+    one_group_offsets_m = torch.tensor([M], dtype=torch.int32, device=device)
+    scales0_blocked_ref = triton_mx_block_rearrange_2d_M_groups(
+        scales0_ref_rm, one_group_offsets_m
+    )
+
+    # Reference dim1 pipeline: quantize along M (returns transposed data),
+    # then blocked-rearrange.
+    qdata1_t_ref, scales1_ref_rm = triton_to_mxfp8_dim1(
+        x, inner_block_size=32, scaling_mode=scaling_mode_str
+    )
+    one_group_offsets_n = torch.tensor([N], dtype=torch.int32, device=device)
+    scales1_blocked_ref = triton_mx_block_rearrange_2d_M_groups(
+        scales1_ref_rm, one_group_offsets_n
+    )
+
+    # qdata_dim0: (M, N) row-major e4m3 - bit-exact parity.
+    assert qdata0_fused.shape == (M, N)
+    assert qdata0_fused.dtype == torch.float8_e4m3fn
+    assert torch.equal(
+        qdata0_fused.view(torch.uint8), qdata0_ref.view(torch.uint8)
+    ), "fused dim0 fp8 data differs from triton_to_mxfp8_dim0 reference"
+
+    # qdata_dim1_t: (N, M) row-major e4m3 - bit-exact parity vs.
+    # transposed dim1 reference. ``triton_to_mxfp8_dim1`` returns its data
+    # as a ``.t()`` view of an (N, M) row-major column-major-of-x tensor,
+    # so calling ``.t()`` on the reference peels the view back to the raw
+    # (N, M) row-major storage we produce.
+    assert qdata1_t_fused.shape == (N, M)
+    assert qdata1_t_fused.dtype == torch.float8_e4m3fn
+    qdata1_t_ref_rowmajor = qdata1_t_ref.t().contiguous()
+    assert torch.equal(
+        qdata1_t_fused.view(torch.uint8),
+        qdata1_t_ref_rowmajor.view(torch.uint8),
+    ), "fused dim1-transpose fp8 data differs from triton_to_mxfp8_dim1 reference"
+
+    # Blocked scales for dim0: compare via from_blocked canonical view
+    # (128x4 blocks may have gap cells that are legitimately uninitialized).
+    _assert_blocked_scales_equal(scales0_fused, scales0_blocked_ref, M, N // 32)
+    # Blocked scales for dim1 live in an (N, M/32) logical tensor.
+    _assert_blocked_scales_equal(scales1_fused, scales1_blocked_ref, N, M // 32)
+
+
 @pytest.mark.parametrize("round_scales_to_power_of_2", [True, False])
 @pytest.mark.parametrize(
     "m,k",
diff --git a/torchao/prototype/moe_training/kernels/mxfp8/__init__.py b/torchao/prototype/moe_training/kernels/mxfp8/__init__.py
@@ -19,3 +19,6 @@
     triton_mxfp8_dispatch_and_quantize,  # noqa: F401
     triton_mxfp8_pad_and_quantize,  # noqa: F401
 )
+from torchao.prototype.moe_training.kernels.mxfp8.triton_grad_quantize import (
+    triton_mxfp8_quantize_dim0_dim1,  # noqa: F401
+)
diff --git a/torchao/prototype/moe_training/kernels/mxfp8/triton_grad_quantize.py b/torchao/prototype/moe_training/kernels/mxfp8/triton_grad_quantize.py

Original file line number	Diff line number	Diff line change
`@@ -19,3 +19,6 @@`
`19`	`19`	`triton_mxfp8_dispatch_and_quantize, # noqa: F401`
`20`	`20`	`triton_mxfp8_pad_and_quantize, # noqa: F401`
`21`	`21`	`)`
	`22`	`+from torchao.prototype.moe_training.kernels.mxfp8.triton_grad_quantize import (`
	`23`	`+ triton_mxfp8_quantize_dim0_dim1, # noqa: F401`
	`24`	`+)`