Refactor `use_triton_kernel` to use `nvfp4_quantize_kernel_choice` by jerryzh168 · Pull Request #3911 · pytorch/ao

jerryzh168 · 2026-02-17T23:28:44Z

Stack from ghstack (oldest at bottom):

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

use_triton_kernel==True --> QuantizeToNVFP4KernelChoice.MSLK
use_triton_kernel==False --> QuantizeToNVFP4KernelChoice.TRITON

Note: this breaks BC for the users of the prototype API

for configs whose default is use_triton_kernel = True (e.g. NVFP4DynamicActivationNVFP4WeightConfig), an error will be thrown when the flag is set to False,
for configs whose default is use_triton_kernel = False (e.g. NVFP4FakeQuantizeConfig), an error will be thrown when the flag is set to True

we'll make these changes internally later

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2026-02-17T23:28:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3911

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9d61f60 with merge base 15df843 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…rence`" Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

…_choice`" Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

vkuzo · 2026-02-19T13:18:39Z

        leading_dims, M, K = data_hp.shape[:-2], data_hp.shape[-2], data_hp.shape[-1]

-        if use_triton_kernel:
+        if nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.TRITON:


this logic is choosing a single kernel, can we just have one if statement instead of adding an intermediate variable kernel_choice?

vkuzo · 2026-02-19T13:18:57Z

-        if use_triton_kernel:
+        if nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.TRITON:
+            kernel_choice = "triton"
+        elif nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.FLASHINFER:


is this PR adding flashinfer or is that the next PR?

ah sorry, this should be in the next PR

vkuzo · 2026-02-19T13:19:23Z

+            # flashinfer uses global_sf = (F8E4M3_MAX * F4_E2M1_MAX) / amax
+            # which is 1 / per_tensor_scale
+            global_sf = 1.0 / per_tensor_scale
+            data_lp, blockwise_scales = flashinfer_nvfp4_quantize(


are data_lp and blockwise_scales bitwise equivalent to the torch and triton paths?

these are not bitwise equivalent I think, tested in next PR

vkuzo · 2026-02-19T13:22:13Z



+class NVFP4QuantizeKernelChoice(str, Enum):
+    """Enum for specifying the kernel used for NVFP4 quantization."""


nit: make this more specific to explain what exactly this kernel is doing, "nvfp4 quantization" is correct but ambiguous

sg. btw, I saw block_size: Block size for quantization (must be 16), is this true? why do we make this an argument if it has to be fixed?

It's just the specification: https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference

yeah confirmed with Vasiliy that we can remove this arg in the future

…_choice`" Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

vkuzo · 2026-02-19T17:29:13Z

 aten = torch.ops.aten


+class NVFP4QuantizeKernelChoice(str, Enum):


put in torchao/prototype/mx_formats/constants.py (or a similar file if already exists)

vkuzo · 2026-02-19T17:30:49Z

        orig_dtype (torch.dtype): Original tensor dtype before quantization
        is_swizzled_scales (bool): Whether scales are stored in swizzled (blocked) format
-        use_triton_kernel (bool): Whether to use triton kernels
+        nvfp4_quantize_kernel_choice (NVFP4QuantizeKernelChoice): Kernel preference for quantization


quantize_kernel_choice? can be consistent for similar functionality in other workflows

or, quantize_to_nvfp4_kernel_choice? clearer name

OK will change to quantize_to_nvfp4_kernel_choice for now

can rename to quantize_kernel_choice later when there are similar cases I think

vkuzo · 2026-02-19T17:33:54Z

+    use_triton_kernel: Optional[bool] = None
+
+    def __post_init__(self):
+        self.nvfp4_quantize_kernel_choice = _handle_use_triton_kernel(


I think this should throw an exception if the user specified use_triton_kernel=True and anything other than kernel preference triton, and do nothing else. Setting self.use_triton_kernel to None is confusing here, let's just enforce everything is consistent and keep it.

this would mean bc-breaking for current callsites, I was planning to not break bc in this PR, and then refactor all OSS and internal callsites, and then break bc, does that sound OK?

the BC should not break as long as use_triton_kernel and nvfp4_quantize_kernel_choice both default to using the triton kernel

I think this should throw an exception if the user specified use_triton_kernel=True and anything other than kernel preference triton, and do nothing else.

should this ignore the case of use_triton_kernel=False and kernel_choice triton?

the BC should not break as long as use_triton_kernel and nvfp4_quantize_kernel_choice both default to using the triton kernel

what about user setting use_triton_kernel=False?

you can throw an exception if the two vars do not match

…_choice`" Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

jerryzh168 · 2026-04-15T19:40:16Z

seems mslk kernels can give us similar performance as flashinfer kernel, this is no longer needed

Refactor use_triton_kernel to use quantize_kernel_preference

fad1feb

Summary: This is to prefer the addition of flashinfer quantize kernel path in next PR Test Plan: python test/prototype/mx_formats/test_inference_workflow.py Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2026

This was referenced Feb 17, 2026

Add flashinfer-python to CI #3910

Merged

Add support for flashinfer quantize kernel option for nvfp4 #3912

Closed

jerryzh168 added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Feb 17, 2026

jerryzh168 added 4 commits February 17, 2026 15:46

jerryzh168 requested review from andrewor14, jcaip and vkuzo February 18, 2026 18:19

jerryzh168 added 2 commits February 18, 2026 10:58

jerryzh168 changed the title ~~Refactor use_triton_kernel to use quantize_kernel_preference~~ Refactor use_triton_kernel to use nvfp4_quantize_kernel_choice Feb 18, 2026

vkuzo reviewed Feb 19, 2026

View reviewed changes

Comment thread torchao/prototype/mx_formats/inference_workflow.py Outdated

vkuzo reviewed Feb 19, 2026

View reviewed changes

jerryzh168 mentioned this pull request Feb 19, 2026

Update use_triton_kernel after refactor #3918

Open

jerryzh168 requested a review from vkuzo February 19, 2026 16:59

vkuzo reviewed Feb 19, 2026

View reviewed changes

jerryzh168 added 2 commits February 19, 2026 15:48

jerryzh168 added 5 commits February 19, 2026 16:15

jerryzh168 requested a review from vkuzo March 14, 2026 07:53

jerryzh168 closed this Apr 15, 2026



		class NVFP4QuantizeKernelChoice(str, Enum):
		"""Enum for specifying the kernel used for NVFP4 quantization."""

		aten = torch.ops.aten


		class NVFP4QuantizeKernelChoice(str, Enum):

Conversation

jerryzh168 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3911

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jerryzh168 commented Feb 17, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 17, 2026 •

edited

Loading

jerryzh168 Feb 19, 2026 •

edited

Loading