Skip to content

Add blockwise FP8 roofline benchmark#4344

Open
iamzainhuda wants to merge 6 commits intomainfrom
fp8-blockwise-roofline-benchmark
Open

Add blockwise FP8 roofline benchmark#4344
iamzainhuda wants to merge 6 commits intomainfrom
fp8-blockwise-roofline-benchmark

Conversation

@iamzainhuda
Copy link
Copy Markdown
Contributor

@iamzainhuda iamzainhuda commented Apr 27, 2026

Summary

Added FP8 blockwise linear into roofline script. This will let us compare performance of our FP8 blockwise linear layer against FP16 linear layer across the DeepSeek V3 shapes.

Added:

  • A blockwise_fp8_training roofline mode selected via --mx_recipe_name=blockwise_fp8_training.
  • A DSV3 16B/671B FFN shape generator selected via --shape_gen_name=dsv3-16b-671b.
  • Default DSV3 training-shape parameters: M=seq_len=4096, dim=7168, inter_dim=18432.
  • Blockwise FP8 roofline estimates using roofline_utils hardware specs, including FP8 GEMM time and modeled quantization overhead.
  • Flags to select the blockwise GEMM backend and optional compilation:
    • --blockwise_use_triton=False uses the scaled_mm backend.
    • --blockwise_use_triton=True uses the custom Triton GEMM backend.
    • --blockwise_compile_benchmarks=True compiles the blockwise benchmark path.

Roofline comparison column added as:

  • b_fp8_e2e_spdp_pct_of_r: measured speedup / roofline speedup * 100.

Results

  • GPU: NVIDIA H100 80GB HBM3
  • PyTorch: 2.13.0a0+gitd129991
  • torchao: 0.17.0+gitcf0b50ae1
  • Shape set: dsv3-16b-671b
  • Backend: blockwise_scaled_mm
  • Roofline BF16 time: 4.209 ms
  • Roofline FP8 GEMM time: 2.103 ms
  • Roofline FP8 quantization overhead: 0.647 ms
  • Roofline FP8 total time: 2.750 ms
  • Roofline speedup estimate: 1.530x

Non-compiled command:

python benchmarks/float8/float8_roofline.py \
  --outfile=/tmp/blockwise_dsv3_roofline_final_noncompiled.csv \
  --shape_gen_name=dsv3-16b-671b \
  --mx_recipe_name=blockwise_fp8_training \
  --do_benchmarks=True
layer shape (M,K,N) BF16 ms FP8 ms speedup speedup % of roofline
dsv3.ffn.w1 (4096,7168,18432) 3.943 3.091 1.275x 83.34%
dsv3.ffn.w2 (4096,18432,7168) 3.955 3.091 1.280x 83.61%
dsv3.ffn.w3 (4096,7168,18432) 3.976 3.096 1.284x 83.93%
average 3.958 3.093 1.280x 83.63%

Compiled command:

python benchmarks/float8/float8_roofline.py \
  --outfile=/tmp/blockwise_dsv3_roofline_final_compiled.csv \
  --shape_gen_name=dsv3-16b-671b \
  --mx_recipe_name=blockwise_fp8_training \
  --blockwise_compile_benchmarks=True \
  --do_benchmarks=True
layer shape (M,K,N) BF16 ms FP8 ms speedup speedup % of roofline
dsv3.ffn.w1 (4096,7168,18432) 3.959 3.282 1.206x 78.82%
dsv3.ffn.w2 (4096,18432,7168) 3.961 3.283 1.206x 78.84%
dsv3.ffn.w3 (4096,7168,18432) 3.974 3.280 1.211x 79.16%
average 3.964 3.282 1.208x 78.94%

Note: the compiled run emits a Dynamo warning for _scaled_mm_v2 on this
script-only branch. The custom-op routing that removes that warning is part of an upcoming PR (will link when ready)

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 27, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4344

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d11f401 with merge base 9058b58 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026
@danielvegamyhre danielvegamyhre self-requested a review April 27, 2026 19:29
@danielvegamyhre
Copy link
Copy Markdown
Contributor

@iamzainhuda nice progress, i haven't looked at the code yet (lmk when it's ready for review) but please be sure to include the benchmark data for DSV3 671b shapes in the PR description when done

@iamzainhuda iamzainhuda marked this pull request as ready for review April 29, 2026 19:28
@iamzainhuda iamzainhuda added this to the FP8 Blockwise Training milestone Apr 29, 2026
@iamzainhuda
Copy link
Copy Markdown
Contributor Author

@iamzainhuda nice progress, i haven't looked at the code yet (lmk when it's ready for review) but please be sure to include the benchmark data for DSV3 671b shapes in the PR description when done

added the benchmark results for both compile and non compiled runs with scaled_mm backend. i have another PR with a kernel optimization that should improve the speedup a bit

@danielvegamyhre
Copy link
Copy Markdown
Contributor

@iamzainhuda please fix the ruff/linter issue

@iamzainhuda iamzainhuda added the module: training quantize_ api training flow label May 1, 2026
@iamzainhuda
Copy link
Copy Markdown
Contributor Author

@iamzainhuda please fix the ruff/linter issue

yup fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: training quantize_ api training flow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants