Skip to content

Refactor use_triton_kernel to use nvfp4_quantize_kernel_choice#3911

Closed
jerryzh168 wants to merge 16 commits intogh/jerryzh168/42/basefrom
gh/jerryzh168/42/head
Closed

Refactor use_triton_kernel to use nvfp4_quantize_kernel_choice#3911
jerryzh168 wants to merge 16 commits intogh/jerryzh168/42/basefrom
gh/jerryzh168/42/head

Conversation

@jerryzh168
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 commented Feb 17, 2026

Stack from ghstack (oldest at bottom):

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

use_triton_kernel==True --> QuantizeToNVFP4KernelChoice.MSLK
use_triton_kernel==False --> QuantizeToNVFP4KernelChoice.TRITON

Note: this breaks BC for the users of the prototype API

for configs whose default is use_triton_kernel = True (e.g. NVFP4DynamicActivationNVFP4WeightConfig), an error will be thrown when the flag is set to False,
for configs whose default is use_triton_kernel = False (e.g. NVFP4FakeQuantizeConfig), an error will be thrown when the flag is set to True

we'll make these changes internally later

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 17, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3911

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9d61f60 with merge base 15df843 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 17, 2026
@jerryzh168 jerryzh168 added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Feb 17, 2026
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…rence`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@jerryzh168 jerryzh168 changed the title Refactor use_triton_kernel to use quantize_kernel_preference Refactor use_triton_kernel to use nvfp4_quantize_kernel_choice Feb 18, 2026
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
leading_dims, M, K = data_hp.shape[:-2], data_hp.shape[-2], data_hp.shape[-1]

if use_triton_kernel:
if nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.TRITON:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic is choosing a single kernel, can we just have one if statement instead of adding an intermediate variable kernel_choice?

if use_triton_kernel:
if nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.TRITON:
kernel_choice = "triton"
elif nvfp4_quantize_kernel_choice == NVFP4QuantizeKernelChoice.FLASHINFER:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this PR adding flashinfer or is that the next PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, this should be in the next PR

# flashinfer uses global_sf = (F8E4M3_MAX * F4_E2M1_MAX) / amax
# which is 1 / per_tensor_scale
global_sf = 1.0 / per_tensor_scale
data_lp, blockwise_scales = flashinfer_nvfp4_quantize(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are data_lp and blockwise_scales bitwise equivalent to the torch and triton paths?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not bitwise equivalent I think, tested in next PR

Comment thread torchao/prototype/mx_formats/inference_workflow.py Outdated


class NVFP4QuantizeKernelChoice(str, Enum):
"""Enum for specifying the kernel used for NVFP4 quantization."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this more specific to explain what exactly this kernel is doing, "nvfp4 quantization" is correct but ambiguous

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. btw, I saw block_size: Block size for quantization (must be 16), is this true? why do we make this an argument if it has to be fixed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah confirmed with Vasiliy that we can remove this arg in the future

…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
aten = torch.ops.aten


class NVFP4QuantizeKernelChoice(str, Enum):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in torchao/prototype/mx_formats/constants.py (or a similar file if already exists)

orig_dtype (torch.dtype): Original tensor dtype before quantization
is_swizzled_scales (bool): Whether scales are stored in swizzled (blocked) format
use_triton_kernel (bool): Whether to use triton kernels
nvfp4_quantize_kernel_choice (NVFP4QuantizeKernelChoice): Kernel preference for quantization
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantize_kernel_choice? can be consistent for similar functionality in other workflows

or, quantize_to_nvfp4_kernel_choice? clearer name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will change to quantize_to_nvfp4_kernel_choice for now

can rename to quantize_kernel_choice later when there are similar cases I think

Comment thread torchao/prototype/qat/nvfp4.py Outdated
use_triton_kernel: Optional[bool] = None

def __post_init__(self):
self.nvfp4_quantize_kernel_choice = _handle_use_triton_kernel(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should throw an exception if the user specified use_triton_kernel=True and anything other than kernel preference triton, and do nothing else. Setting self.use_triton_kernel to None is confusing here, let's just enforce everything is consistent and keep it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would mean bc-breaking for current callsites, I was planning to not break bc in this PR, and then refactor all OSS and internal callsites, and then break bc, does that sound OK?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the BC should not break as long as use_triton_kernel and nvfp4_quantize_kernel_choice both default to using the triton kernel

Copy link
Copy Markdown
Contributor Author

@jerryzh168 jerryzh168 Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should throw an exception if the user specified use_triton_kernel=True and anything other than kernel preference triton, and do nothing else.

should this ignore the case of use_triton_kernel=False and kernel_choice triton?

the BC should not break as long as use_triton_kernel and nvfp4_quantize_kernel_choice both default to using the triton kernel

what about user setting use_triton_kernel=False?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can throw an exception if the two vars do not match

…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
…_choice`"

Summary:
This is to prefer the addition of flashinfer quantize kernel path in next PR

Test Plan:
python test/prototype/mx_formats/test_inference_workflow.py

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@jerryzh168 jerryzh168 requested a review from vkuzo March 14, 2026 07:53
@jerryzh168
Copy link
Copy Markdown
Contributor Author

seems mslk kernels can give us similar performance as flashinfer kernel, this is no longer needed

@jerryzh168 jerryzh168 closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants