Implements rounding mode for NVFP4 tensor by syed-ahmed · Pull Request #3384 · pytorch/ao

syed-ahmed · 2025-11-24T18:03:53Z

Closes #3264.
Code Assistant Used: Claude Code Opus 4.6

Summary

Implements RS (stochastic rounding) support for NVFP4 as described in the RFC. RN remains the default.

Key implementation details

Triton kernel path: Uses hardware cvt.rs.satfinite.e2m1x4.f32 PTX inline asm. The RS instruction converts 4 floats and 1 random uint32 into 2 packed FP4 bytes, while the RN instruction converts 2 floats into 1 byte. Since the RN path uses pack=4 (8 floats per invocation), the RS path issues two cvt.rs calls to match. With pack=4, 4 rbits values are loaded but only 2 are consumed per invocation — the other 2 are wasted to keep the output layout identical to RN.

Seed determinism: Triton path takes an explicit seed parameter for tl.randint; PyTorch path uses torch.manual_seed. Same seed produces bitwise-identical results within each path. The two paths use different RNGs so they diverge for RS even with the same seed.

Validation: Both paths validate rounding_mode using not in RoundingMode, raising ValueError for invalid values.

RoundingMode enum uses int values (RN=0, RS=1) rather than string values from the RFC, since the Triton kernel needs an integer tl.constexpr.

Tests

Single parametrized test_f4_rounding in test_kernels.py covers rounding mode (RN/RS/invalid), kernel (PyTorch/Triton), seed determinism (same/different seeds), and value axes. Verifies RN is biased to nearest, RS is unbiased in expectation, same seed is deterministic, different seeds diverge, and invalid mode raises.

Existing test_nvfp4_tensor.py tests are parametrized over rounding_mode and _triton_kernel_params. Triton-vs-PyTorch equivalence uses SQNR threshold of 40 for RN and 8 for RS (different RNGs). End-to-end matmul uses SQNR threshold of 16 for RN and 6 for RS.

Test Environment

Collecting environment information...                                                                                     
PyTorch version: 2.12.0a0+gitbabda95                                                                                      
Is debug build: False                                                                                                     
CUDA used to build PyTorch: 13.2                                                                                          
ROCM used to build PyTorch: N/A                                                                                           
                                                                                                                          
OS: Ubuntu 24.04.4 LTS (aarch64)                                                                                          
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0                                                                      
Clang version: Could not collect                                                                                          
CMake version: version 3.31.6                                                                                             
Libc version: glibc-2.39                                                                                                  
                                                                                                                          
Python version: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] (64-bit runtime)                                        
Python platform: Linux-6.14.0-1008-nvidia-64k-aarch64-with-glibc2.39                                                      
Is CUDA available: True                                                                                                   
CUDA runtime version: 13.2.50                                                                                             
CUDA_MODULE_LOADING set to: LAZY                                                                                          
GPU models and configuration:                                                                                             
GPU 0: NVIDIA GB200                                                                                                       
GPU 1: NVIDIA GB200                                                                                                       
GPU 2: NVIDIA GB200                                                                                                       
GPU 3: NVIDIA GB200

Nvidia driver version: 580.105.08                                                                                                                                                                                                                                                          

Versions of relevant libraries:                                                                                     
[pip3] mypy==1.16.0                                       
[pip3] mypy_extensions==1.1.0                             
[pip3] numpy==1.26.4                                      
[pip3] nvidia-cudnn-frontend==1.18.0                                                                                
[pip3] onnx==1.20.0                                       
[pip3] onnx-ir==0.1.16                                    
[pip3] onnxscript==0.6.2                                  
[pip3] optree==0.13.0                                     
[pip3] pytorch-lightning==2.6.1                                                                                     
[pip3] torch==2.12.0a0+gitbabda95                                                                                   
[pip3] torchmetrics==1.8.2                                
[pip3] triton==3.6.0+git9844da95                                                                                    
[conda] Could not collect

pytorch-bot · 2025-11-24T18:03:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3384

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg

We should have the lower level ops all be functional, e.g. take in a seed anlgous to what is done for the tritonkernel, but same for the _f32_to_floatx_unpacked

And then for how users use this e.g.: _seed = torch.randint(2**31, (1,)).item()

in the nvfp4 tensors I am not entirely sure I like this..

syed-ahmed · 2026-03-04T23:12:12Z

I think that should be fine, and I can make the seed handling more functional. As a side note, may be need to review TE's rng_state mechanism a little bit and make sure we have full control over the determinism knobs.

drisspg · 2026-03-05T04:53:31Z

TBH this PR has sparked some convo on how we expect users to handle RNG for RS mode.

I am not sure I have perosnally have an infromed enough opinion on the difficulties and nuance here.

maybe @vkuzo has some better ideas here

My intial thought is lower level is functional. And then user sets a seed (philox state), ensure that we properly increment philox offset before every invocation of RS which will pull from there to generate the source of randomness. Also ensure that cudagraphs work here. I bet inductor will not behave well here though.

@jbschlosser just to throw some more ideas into the ring

jbschlosser · 2026-03-10T16:17:06Z

I'm currently looking into the API design for a new set of "stateless RNG" APIs (i.e. pseudorandom number generation without pytorch maintaining global PRNG state). My concern is that this composes well with RS mode as implemented in this PR, and, more generally, with any implementations of stochastic rounding throughout torch / torchao.

My intial thought is lower level is functional. And then user sets a seed (philox state), ensure that we properly increment philox offset before every invocation of RS which will pull from there to generate the source of randomness.

so I generally agree with Driss's statement here. I think I'd want to see the lower level ops accept a philox seed / offset and that should satisfy composability with a new stateless RNG API.

syed-ahmed · 2026-03-10T17:40:55Z

Thanks @drisspg and @jbschlosser for the input! I'll refactor the ops to be functional and only take seed, and remove the .item(). For the eager path, should _f32_to_floatx_unpacked take a seed (and so consume the seed in a torch.Generator, and that Generator gets passed to rand_ints) or should we just have a rand_int tensor as input?

syed-ahmed · 2026-03-11T03:29:28Z

@pytorchbot label module: training

pytorch-bot · 2026-03-11T03:29:31Z

Didn't find following labels among repository labels: module:,training

syed-ahmed · 2026-03-11T03:43:02Z

@pytorchbot label training

pytorch-bot · 2026-03-11T03:43:05Z

Didn't find following labels among repository labels: training

syed-ahmed · 2026-03-11T04:19:16Z

Ok, addressed the comments.

.item() call is removed. Triton kernel now loads a seed.
eager path expects a rand_bits tensor now keeping generator state outside.
to_nvfp4 expects caller to pass rand_bits.
added a CUDA graph composabilty test for both eager and torch.compile.

crcrpar · 2026-03-11T04:46:47Z

purely cosmetic but auto can be used while it'd start with 1, not 0 -- https://docs.python.org/3/library/enum.html#enum.auto

crcrpar · 2026-03-11T04:50:22Z



-def f32_to_f4_unpacked(x):
+def f32_to_f4_unpacked(x, rounding_mode=RoundingMode.RN, rand_bits=None):


i feel a type annotation is good to have for this method

def f32_to_f4_unpacked(x: torch.Tensor, rounding_mode: RoundingMode = RoundingMode.RN, rand_bits: torch.Tensor | None = None):

drisspg · 2026-03-13T18:19:56Z

    assert data_hp.is_contiguous(), "Only support contiguous data for now"
    assert block_size == 16, "NVFP4 requires block_size=16"
+    if rounding_mode == RoundingMode.RS:
+        assert rand_bits is not None and rand_bits.numel() > 1, (


this numel check si werid it should just do a direct shape compare right?

hmm I guess it needs to have two differtn forms 1 for triton and one for fallback?

Yes. For triton, I only need a fresh seed which informs the philox in triton where to start from and the offsets ensure getting the correct random number. For the fallback, we are generating all the randoms outside the quantize.

jbschlosser · 2026-03-26T17:05:42Z

+        offs_m = pid_m * 128 + tl.arange(0, 128)[:, None]
+        offs_n = pid_n * 32 + tl.arange(0, 32)[None, :]
+        out_offs = offs_m * (N // 2) + offs_n


I'm worried about how these offsets are calculated wrt randomness below during stochastic rounding application. Passing a seed only to this kernel does not seem like enough to avoid undesirable RNG reuse. I think at the very least, we'd want to accept a base offset to start from. And even more ideally from a control perspective, we'd allow for a full set of (seed, offset) pairs controlling every single application of rounding (but I realize there will likely be perf issues with this).

Thanks @jbschlosser for the comment.

Would it be better to write a fused kernel for now, similar to MSLK: https://github.com/meta-pytorch/MSLK/blob/main/mslk/quantize/triton/fp4_quantize.py#L723 when composing with random hadamard transform (#4040)?

It seems like there is some discussion on how to support stochastic rounding better here: pytorch/pytorch#175409, so not sure anymore if the approach in this PR would make sense for UX and performance.

Co-authored-by: Masaki <mkozuki@nvidia.com>

- Remove incorrect assert per_tensor_scale is not None: both mslk_quantize_nvfp4 and triton_quantize_nvfp4 accept None; the assert also blocked dynamo fullgraph=True - Gate MSLK path on rounding_mode == RoundingMode.RN: MSLK ignores rand_bits/rounding_mode, so RS requests must route to triton_quantize_nvfp4 instead - Use torch.ops.ao.triton_quantize_nvfp4 instead of bare name: dynamo cannot resolve names defined inside conditional scopes; the op registry handle works correctly - Fix ROUNDING_MODE == 0 -> == 1 in kernels: RoundingMode.RN.value == 1 (auto() starts at 1), so == 0 always routed to the stochastic branch; fixed in both production kernel and test wrapper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

syed-ahmed marked this pull request as draft November 24, 2025 18:03

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 24, 2025

syed-ahmed added this to PyTorch + CUDA Nov 24, 2025

syed-ahmed mentioned this pull request Feb 11, 2026

NVFP4 Training Tracker #3293

Open

59 tasks

syed-ahmed force-pushed the rounding-mode branch from d8792dd to 70485c5 Compare March 4, 2026 03:35

syed-ahmed marked this pull request as ready for review March 4, 2026 03:39

syed-ahmed changed the title ~~[WIP] Implements rounding mode for NVFP4 tensor~~ Implements rounding mode for NVFP4 tensor Mar 4, 2026

syed-ahmed moved this to In Progress in PyTorch + CUDA Mar 4, 2026

danielvegamyhre requested review from danielvegamyhre and drisspg March 4, 2026 20:09

drisspg reviewed Mar 4, 2026

View reviewed changes

danielvegamyhre added the module: training quantize_ api training flow label Mar 11, 2026

syed-ahmed requested a review from drisspg March 11, 2026 04:14

crcrpar reviewed Mar 11, 2026

View reviewed changes

drisspg reviewed Mar 13, 2026

View reviewed changes

syed-ahmed force-pushed the rounding-mode branch from 353d24a to 43b0e61 Compare March 17, 2026 18:47

jbschlosser reviewed Mar 26, 2026

View reviewed changes

syed-ahmed and others added 3 commits April 17, 2026 14:24

Implements rounding mode

a5c3625

Addresses review comments

36013c0

Apply suggestions from code review

50eb642

Co-authored-by: Masaki <mkozuki@nvidia.com>

drisspg and others added 3 commits April 17, 2026 14:25

Apply suggestions from code review

5bd348c

Co-authored-by: Masaki <mkozuki@nvidia.com>

Apply suggestions from code review

70e716a

Co-authored-by: Masaki <mkozuki@nvidia.com>

syed-ahmed force-pushed the rounding-mode branch from 43b0e61 to a569bd0 Compare April 17, 2026 21:27

syed-ahmed requested review from jerryzh168 and vkuzo as code owners April 17, 2026 21:27



		def f32_to_f4_unpacked(x):
		def f32_to_f4_unpacked(x, rounding_mode=RoundingMode.RN, rand_bits=None):

Conversation

syed-ahmed commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key implementation details

Tests

Test Environment

Uh oh!

pytorch-bot bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3384

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

syed-ahmed commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Mar 5, 2026

Uh oh!

jbschlosser commented Mar 10, 2026

Uh oh!

syed-ahmed commented Mar 10, 2026

Uh oh!

syed-ahmed commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

syed-ahmed commented Mar 11, 2026

Uh oh!

pytorch-bot bot commented Mar 11, 2026

Uh oh!

syed-ahmed commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crcrpar Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

crcrpar Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drisspg Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

jbschlosser Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

syed-ahmed Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

syed-ahmed commented Nov 24, 2025 •

edited

Loading

pytorch-bot bot commented Nov 24, 2025 •

edited

Loading

syed-ahmed commented Mar 4, 2026 •

edited

Loading

syed-ahmed commented Mar 11, 2026 •

edited

Loading

jbschlosser Mar 26, 2026 •

edited

Loading