Add blockwise FP8 roofline benchmark by iamzainhuda · Pull Request #4344 · pytorch/ao

iamzainhuda · 2026-04-27T18:32:32Z

Summary

Added FP8 blockwise linear into roofline script. This will let us compare performance of our FP8 blockwise linear layer against FP16 linear layer across the DeepSeek V3 shapes.

Added:

A blockwise_fp8_training roofline mode selected via --mx_recipe_name=blockwise_fp8_training.
A DSV3 16B/671B FFN shape generator selected via --shape_gen_name=dsv3-16b-671b.
Default DSV3 training-shape parameters: M=seq_len=4096, dim=7168, inter_dim=18432.
Blockwise FP8 roofline estimates using roofline_utils hardware specs, including FP8 GEMM time and modeled quantization overhead.
Flags to select the blockwise GEMM backend and optional compilation:
- --blockwise_use_triton=False uses the scaled_mm backend.
- --blockwise_use_triton=True uses the custom Triton GEMM backend.
- --blockwise_compile_benchmarks=True compiles the blockwise benchmark path.

Roofline comparison column added as:

b_fp8_e2e_spdp_pct_of_r: measured speedup / roofline speedup * 100.

Results

GPU: NVIDIA H100 80GB HBM3
PyTorch: 2.13.0a0+gitd129991
torchao: 0.17.0+gitcf0b50ae1
Shape set: dsv3-16b-671b
Backend: blockwise_scaled_mm
Roofline BF16 time: 4.209 ms
Roofline FP8 GEMM time: 2.103 ms
Roofline FP8 quantization overhead: 0.647 ms
Roofline FP8 total time: 2.750 ms
Roofline speedup estimate: 1.530x

Non-compiled command:

python benchmarks/float8/float8_roofline.py \
  --outfile=/tmp/blockwise_dsv3_roofline_final_noncompiled.csv \
  --shape_gen_name=dsv3-16b-671b \
  --mx_recipe_name=blockwise_fp8_training \
  --do_benchmarks=True

layer	shape `(M,K,N)`	BF16 ms	FP8 ms	speedup	speedup % of roofline
`dsv3.ffn.w1`	`(4096,7168,18432)`	3.943	3.091	1.275x	83.34%
`dsv3.ffn.w2`	`(4096,18432,7168)`	3.955	3.091	1.280x	83.61%
`dsv3.ffn.w3`	`(4096,7168,18432)`	3.976	3.096	1.284x	83.93%
average		3.958	3.093	1.280x	83.63%

Compiled command:

python benchmarks/float8/float8_roofline.py \
  --outfile=/tmp/blockwise_dsv3_roofline_final_compiled.csv \
  --shape_gen_name=dsv3-16b-671b \
  --mx_recipe_name=blockwise_fp8_training \
  --blockwise_compile_benchmarks=True \
  --do_benchmarks=True

layer	shape `(M,K,N)`	BF16 ms	FP8 ms	speedup	speedup % of roofline
`dsv3.ffn.w1`	`(4096,7168,18432)`	3.959	3.282	1.206x	78.82%
`dsv3.ffn.w2`	`(4096,18432,7168)`	3.961	3.283	1.206x	78.84%
`dsv3.ffn.w3`	`(4096,7168,18432)`	3.974	3.280	1.211x	79.16%
average		3.964	3.282	1.208x	78.94%

Note: the compiled run emits a Dynamo warning for _scaled_mm_v2 on this
script-only branch. The custom-op routing that removes that warning is part of an upcoming PR (will link when ready)

pytorch-bot · 2026-04-27T18:32:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4344

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d11f401 with merge base 9058b58 ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t 90b6219a45bec5456e4e54cc01f7f3184cef600468b048f891b477eb1f1b9790 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2026-04-27T19:30:59Z

@iamzainhuda nice progress, i haven't looked at the code yet (lmk when it's ready for review) but please be sure to include the benchmark data for DSV3 671b shapes in the PR description when done

iamzainhuda · 2026-04-29T19:29:39Z

@iamzainhuda nice progress, i haven't looked at the code yet (lmk when it's ready for review) but please be sure to include the benchmark data for DSV3 671b shapes in the PR description when done

added the benchmark results for both compile and non compiled runs with scaled_mm backend. i have another PR with a kernel optimization that should improve the speedup a bit

danielvegamyhre · 2026-04-30T21:32:27Z

@iamzainhuda please fix the ruff/linter issue

iamzainhuda · 2026-05-01T00:33:24Z

@iamzainhuda please fix the ruff/linter issue

yup fixed

Add blockwise FP8 roofline benchmark

0aba0ec

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026

danielvegamyhre self-requested a review April 27, 2026 19:29

iamzainhuda added 3 commits April 29, 2026 18:56

Add blockwise FP8 roofline benchmark

05e1885

Merge remote benchmark branch

e6c9f9b

Report blockwise roofline speedup percentage

305fa35

iamzainhuda marked this pull request as ready for review April 29, 2026 19:28

iamzainhuda requested review from jerryzh168 and vkuzo as code owners April 29, 2026 19:28

iamzainhuda added this to the FP8 Blockwise Training milestone Apr 29, 2026

iamzainhuda added the module: training quantize_ api training flow label May 1, 2026

iamzainhuda added 2 commits May 1, 2026 00:28

Model dual blockwise weight quant roofline

5e9c760

Revert dual roofline model and format benchmark

d11f401

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blockwise FP8 roofline benchmark#4344

Add blockwise FP8 roofline benchmark#4344
iamzainhuda wants to merge 6 commits intomainfrom
fp8-blockwise-roofline-benchmark

iamzainhuda commented Apr 27, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

danielvegamyhre commented Apr 27, 2026

Uh oh!

iamzainhuda commented Apr 29, 2026

Uh oh!

danielvegamyhre commented Apr 30, 2026

Uh oh!

iamzainhuda commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iamzainhuda commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Uh oh!

pytorch-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4344

❌ 1 New Failure

Uh oh!

danielvegamyhre commented Apr 27, 2026

Uh oh!

iamzainhuda commented Apr 29, 2026

Uh oh!

danielvegamyhre commented Apr 30, 2026

Uh oh!

iamzainhuda commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamzainhuda commented Apr 27, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 27, 2026 •

edited

Loading