Skip to content

Add reduce_range to avoid overflow in int8 tensor#4266

Merged
Xia-Weiwen merged 16 commits intopytorch:mainfrom
cyxlily:reduce_range
Apr 28, 2026
Merged

Add reduce_range to avoid overflow in int8 tensor#4266
Xia-Weiwen merged 16 commits intopytorch:mainfrom
cyxlily:reduce_range

Conversation

@cyxlily
Copy link
Copy Markdown
Contributor

@cyxlily cyxlily commented Apr 10, 2026

Add reduce_range optional flag. Default False. If True, use reduced int8 range [-64, 63] instead of full range [-128, 127] to reduce overflow risk on platforms without VNNI instructions. Kept optional for backward compatibility with older call sites and serialized configs.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4266

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 4 New Failures

As of commit 4877828 with merge base b3e0db2 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2026
@cyxlily cyxlily marked this pull request as draft April 10, 2026 03:34
Add reduce_range optional flag. Default False. If True, use reduced int8 range
[-64, 63] instead of full range [-128, 127] to reduce overflow risk on
platforms without VNNI instructions. Kept optional for backward compatibility
with older call sites and serialized configs.

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

@claude review

@Xia-Weiwen Xia-Weiwen added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 10, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Apr 10, 2026

Claude finished @Xia-Weiwen's task in 2m 56s —— View job


Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor

  • Read changed files and gather context
  • Review implementation in int8_tensor.py
  • Review config changes in quant_api.py
  • Review test changes
  • Post review feedback

Summary

This PR adds a reduce_range optional flag that narrows the int8 quantization range from [-128, 127] to [-64, 63] to avoid overflow on CPUs without VNNI instructions. The changes are well-scoped and the plumbing is consistent across both dynamic and static int8 quantization paths.

Issues

1. Optional[bool] type annotation is misleading

reduce_range is typed as Optional[bool] across all locations (QuantizeTensorToInt8Kwargs, Int8Tensor, config classes), but the validation in __post_init__ asserts it must be True or FalseNone is never accepted:

# quant_api.py:976
assert self.reduce_range in (True, False), (
    "`reduce_range` must be True or False"
)

Since None is disallowed, the type should be just bool, not Optional[bool]. Using Optional[bool] with a default of False suggests None is a valid sentinel value, which contradicts the validation. The docstring mentions "Kept optional for backward compatibility with older call sites and serialized configs" — if this is about deserialization of older configs that lack the field, the default value of False already handles that. Optional[bool] isn't needed for that purpose.

Fix this →

2. Test skip logic may be too aggressive

In the test at test/quantization/quantize_/workflows/int8/test_int8_tensor.py:539:

if not reduce_range and not _cpu_is_vnni_supported():
    self.skipTest(
        "Only test reduce_range=True on CPUs without VNNI support to avoid int8 overflow."
    )

This skips the reduce_range=False test case on CPUs without VNNI. This is reasonable to avoid overflow failures in CI, but it means the reduce_range=False path is never tested on non-VNNI machines. Consider adding a comment explaining that reduce_range=False is the standard path already covered by the existing test classes (TestInt8DynamicQuant, TestInt8StaticQuant), so this skip is acceptable.

3. Test uses SQNR threshold instead of assertAllClose

Xia-Weiwen's review comment suggested using assertAllClose with a fake-quantized reference rather than SQNR. The current code at line 597 still uses:

assert compute_error(output_fp, output_quantized) > 20

This only validates that quantization doesn't catastrophically degrade quality, but doesn't verify correctness against a reference implementation. For a feature that changes quantization ranges, verifying that the output matches fake-quantized reference output would be a stronger test.

Nits / Minor Suggestions

4. (-64, 63) magic numbers appear in multiple places

The reduced range (-64, 63) is hardcoded in four locations:

  • int8_tensor.py:193 (from_hp)
  • int8_tensor.py:248 (dequantize)
  • test_int8_tensor.py:560 (test)
  • Docstrings

Consider defining these as module-level constants (e.g., _REDUCED_QUANT_MIN = -64, _REDUCED_QUANT_MAX = 63) to avoid duplication and make the reduced range easy to find and change in the future. This is a minor style nit and not blocking.

5. Int8WeightOnlyConfig not updated

Int8WeightOnlyConfig (line 814 of quant_api.py) doesn't get the reduce_range parameter. This is likely intentional since weight-only quantization uses a floating-point activation path (no int8 accumulation overflow risk), but it may be worth a comment in the PR description or docstring clarifying this.

What looks good

  • The reduce_range flag is correctly threaded through all relevant code paths: from_hp, dequantize, config classes, act_quant_kwargs, and the _choose_quant_func_and_quantize_tensor dispatch.
  • Backward compatibility is preserved — the default is False and existing serialized models will work.
  • The reduce_range attribute is properly included in optional_tensor_attribute_names, __repr__, and all aten op implementations (slice, select, _pin_memory).
  • The test covers a comprehensive parameter matrix (dtype, compile, config_mode, granularity, mapping_type, reduce_range).

|

Cui, Lily added 2 commits April 10, 2026 06:56
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
@cyxlily
Copy link
Copy Markdown
Contributor Author

cyxlily commented Apr 13, 2026

Thanks for the suggestion.

  1. We need to keep reduce_range as [Optional[bool]] for backward compatibility with old quantized models. Otherwise we will got error:
    [FAIL] 1 Int8Tensor(s) failed dequantize(): ['fc.weight']
  2. I updated the CPU test so we still run the setup/quantization/forward path on non-VNNI CPUs, but we skip the SQNR (compute_error) assertion specifically when reduce_range=False and VNNI is unavailable. TestInt8DynamicQuant and TestInt8StaticQuant only test GPU.
  3. In this test, I intentionally did not use a fake-quantized reference with assertAllClose.
    The goal here is to measure end-to-end degradation from quantization by comparing FP output and quantized output, rather than checking closeness to a specific reference path.

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Comment thread test/quantization/quantize_/workflows/int8/test_int8_tensor.py Outdated
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

@claude review again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 13, 2026

Claude finished @Xia-Weiwen's task in 2m 28s —— View job


Re-review of PR #4266: Add reduce_range to avoid overflow in int8 tensor

  • Read all changed files and gather context
  • Review implementation in int8_tensor.py
  • Review config changes in quant_api.py
  • Review kwargs changes
  • Review test changes
  • Post review feedback

Summary

The PR is in good shape. The reduce_range flag is correctly threaded through all code paths, the constants are properly extracted, and the test covers a comprehensive parameter matrix. Here are the remaining items to address:

Issues

1. Unresolved reviewer comment: move _cpu_is_vnni_supported to torchao/utils.py

@Xia-Weiwen commented that _cpu_is_vnni_supported (currently imported from torchao/kernel/intmm.py) should be moved to torchao/utils.py. This hasn't been addressed yet. The function at torchao/kernel/intmm.py:121 is a general CPU capability check, not specific to int matmul, so torchao/utils.py is arguably a better home for it.

Fix this →

2. __all__ cleanup in workflows/__init__.py is unrelated to reduce_range

The diff removes duplicate "Int8Tensor" and "QuantizeTensorToInt8Kwargs" entries from __all__ in torchao/quantization/quantize_/workflows/__init__.py. This is a correct cleanup (the base had duplicates), but it's unrelated to the reduce_range feature. Consider noting this in the PR description or splitting it into a separate commit for clarity. Not blocking.

3. Optional[bool] type vs validation — author's response is reasonable

My previous review flagged that Optional[bool] is misleading when None is rejected by validation. The author responded that Optional[bool] is needed for backward compatibility with deserialized old models. This is a valid concern — if an old serialized Int8Tensor lacks the reduce_range attribute, the Optional type combined with the default False ensures smooth loading. The __post_init__ validation on the config classes (which are constructed by users, not deserialized) catches misuse at config creation time. This is acceptable as-is.

Nit

4. Test SQNR assertion silently passes on non-VNNI CPUs with reduce_range=False

At test_int8_tensor.py:602-605:

if reduce_range or _cpu_is_vnni_supported():
    assert compute_error(output_fp, output_quantized) > 20, (...)

When reduce_range=False on a non-VNNI CPU, the test runs the full quantization + forward pass but skips the quality assertion entirely. This means a complete regression (e.g., NaN outputs) would go undetected on non-VNNI CI machines. Consider adding a basic sanity check for the skipped case, such as:

if reduce_range or _cpu_is_vnni_supported():
    assert compute_error(output_fp, output_quantized) > 20, (...)
else:
    # At minimum, verify no NaN/Inf in output
    assert not torch.isnan(output_quantized).any(), "Output contains NaN"
    assert not torch.isinf(output_quantized).any(), "Output contains Inf"

What looks good

  • The reduce_range flag is correctly propagated through all paths: from_hp, dequantize, both config classes, act_quant_kwargs, _choose_quant_func_and_quantize_tensor, and all aten ops (slice, select, _pin_memory).
  • Module-level constants (_REDUCED_QUANT_MIN, _REDUCED_QUANT_MAX, _FULL_QUANT_MIN, _FULL_QUANT_MAX) replace all magic numbers — addresses the nit from the previous review.
  • The _pin_memory implementation now correctly handles act_pre_scale (was previously missing, unrelated to reduce_range but a good bugfix).
  • The test covers a wide parameter matrix: dtype × compile × config_mode × granularity × mapping_type × reduce_range.
  • Backward compatibility is preserved via Optional[bool] with default False.

@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review April 13, 2026 06:56
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Args:
granularity: the granularity for the Tensor, currently either PerRow() or PerTensor()
mapping_type: whether to use symmetric or asymmetric quant
reduce_range: optional flag. If True, use reduced int8 range [-64, 63]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should be consistent on using numbers v.s. _REDUCED_QUANT_MIN etc. I think, right now the quant_api is using the variable but here it's using variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to use variables but might also be good to mention what these values are in the docstring as well

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented in #3784 (comment) I'm wondering if we can just set it automatically and print out a warning, instead of leave this to user?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments, updated.



@common_utils.instantiate_parametrized_tests
class TestInt8TensorCPU(TorchAOIntegrationTestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe create a new test_int8_tensor_cpu.py it will be easier to separate tests by device for CI in the future if we want.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments, updated.

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
_FULL_QUANT_MAX = 127


def _should_use_reduced_range(tensor: torch.Tensor) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the user should set this and torchao can make it easy to pick the right setting, automagically setting this based on hardware is a footgun

Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can expose this as an arg, and by default it will be autopicked based on device, and don't require people to set it?

# reduce_range auto-picked, but can be overwritten
# by default, auto picked based on device
config = Int8DynamicActivationInt8WeightConfig()

# explicit overwrite
config = Int8DynamicActivationInt8WeightConfig(reduce_range=True/False)

or are you talking about user should always explicitly set this arg in the config, like changing all callsites to:

config = Int8DynamicActivationInt8WeightConfig(reduce_range=get_recommended_reduce_range())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo @jerryzh168 Thanks for the comments. To make UX better, we can probably do as @jerryzh168 suggested:

  • if users don't set it explicitly, it is decided automatically (reduce_range=true if on CPU and no VNNI, false otherwise)
  • if users set it explicitly, we just use whichever the users set and the users are responsible for the results.

How does that sound to you? Thanks.

Cui, Lily added 2 commits April 15, 2026 04:02
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
@Xia-Weiwen Xia-Weiwen requested review from jerryzh168 and vkuzo April 16, 2026 02:35
WeightOnlyInt8Quant doesn't have overflow risk, set reduce_range = False
at default.

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Args:
granularity: the granularity for the Tensor, currently either PerRow() or PerTensor()
mapping_type: whether to use symmetric or asymmetric quant
reduce_range: If None, choose the int8 quantization range automatically.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to quote "simple over easy" here: let's make it simple by requiring the user to specify numerics upfront. It's ok to ask users on hardware that needs reduce_range to specify the required flag manually.

Copy link
Copy Markdown
Collaborator

@Xia-Weiwen Xia-Weiwen Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo Thanks for the comments. Just to confirm, you don't have concern if reduce_range is optional and if it's None we use reduce_range=False, do you? This is for backward compatibility in case where people saved their quantized model without this flag and want to load it with a new version of torchao.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that sounds good. It's also find to provide a helper function which returns the "correct" value based on hardware, but the default value should be simple and deterministic (reduce_range False).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I've set reduce_range default to False. @vkuzo @jerryzh168 please review.

Set reduce_range default to False. Users can use should_reduce_range() to
determine.
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @Xia-Weiwen's task in 2m 26s —— View job


Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor

  • Read changed files and gather context
  • Review implementation in int8_tensor.py
  • Review config changes in quant_api.py
  • Review utility changes (utils.py, intmm.py, kwargs)
  • Review test changes
  • Post review feedback

Summary

The PR is well-structured. The reduce_range flag is consistently threaded through all code paths (configs, from_hp, QuantizeTensorToInt8Kwargs, aten ops), backward compatibility is maintained, and the resolution of the auto-detect vs. manual-set discussion aligns with the "simple over easy" principle. One issue that should be fixed before merge, and a few minor items:

Issue: Dead code reintroduced from deleted PR #4211

check_cpu_version and check_xpu_version were intentionally removed from the codebase in PR #4211 (Remove check_cpu_version and check_xpu_version helpers). This PR re-adds them in torchao/utils.py:1173-1195, but neither function is used anywhere — not in this PR or in the rest of the codebase. This appears to be a merge artifact from rebasing onto main after #4211 landed.

These should be removed to avoid contradicting the explicit cleanup in #4211.

Fix this →

Minor items (non-blocking)

1. is_pinned doesn't check act_pre_scale

The aten.is_pinned.default implementation (int8_tensor.py:446-455) checks qdata, scale, zero_point, act_quant_scale, and act_quant_zero_point — but not act_pre_scale. The _pin_memory implementation correctly handles act_pre_scale (good bugfix in this PR), but is_pinned is now inconsistent. If act_pre_scale is present and not pinned, is_pinned would return True while the tensor isn't fully pinned. This predates this PR but is worth fixing while you're in this code.

2. should_reduce_range is not exported from the public torchao.quantization namespace

The docstrings for both Int8DynamicActivationInt8WeightConfig and Int8StaticActivationInt8WeightConfig reference should_reduce_range() as a helper users can call, but it's only importable from the deep internal path torchao.quantization.quantize_.workflows.int8.int8_tensor. Consider re-exporting it from torchao.quantization so users can do:

from torchao.quantization import should_reduce_range

3. None handling in reduce_range

The validation accepts None (assert self.reduce_range in (True, False, None)), and the type is Optional[bool]. When reduce_range=None, it's falsy in Python so if reduce_range: in from_hp will use the full range — equivalent to False. This works correctly by accident but is worth a brief comment in the code, or normalizing None to False in __post_init__ for explicitness.

What looks good

  • _cpu_is_vnni_supported moved from intmm.py to utils.py with the import updated — clean separation of concerns.
  • reduce_range correctly propagated to all aten op implementations (slice, select, _pin_memory).
  • Removing hardcoded quant_min=-128, quant_max=127 from dequantize() is correct — dequantize_affine doesn't use these values in computation (the formula is simply (input - zero_point) * scale), so the default None → full dtype range works fine.
  • The _pin_memory bugfix for act_pre_scale is a good incidental fix.
  • Test moved to a separate test_int8_tensor_cpu.py file per @jerryzh168's suggestion — good for future CI device separation.
  • The test uses should_reduce_range() to dynamically determine the flag, ensuring it passes on both VNNI and non-VNNI CI machines.

Cui, Lily added 2 commits April 20, 2026 02:33
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

Claude finished @Xia-Weiwen's task in 1m 59s —— View job


Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor

  • Read changed files and gather context
  • Review implementation in int8_tensor.py
  • Review config changes in quant_api.py
  • Review utility changes (utils.py, intmm.py, kwargs)
  • Review test changes
  • Check resolution of previous review items
  • Post review feedback

Summary

The PR is in good shape and ready for merge. All major items from previous reviews have been addressed. The reduce_range flag is correctly and consistently threaded through the entire quantization pipeline — configs, from_hp, kwargs dispatch, aten ops, and tests.

Previous review items — status

  1. Dead code (check_cpu_version/check_xpu_version) — Fixed. These are no longer present in utils.py.
  2. _cpu_is_vnni_supported moved to utils.py — Done. The import in intmm.py is updated accordingly.
  3. Test moved to separate test_int8_tensor_cpu.py — Done per @jerryzh168's suggestion.
  4. reduce_range default behavior settled — Per @vkuzo's guidance, the default is False (simple and deterministic), with should_reduce_range() as a helper users can call.
  5. is_pinned now checks act_pre_scale — Fixed at int8_tensor.py:455-456.
  6. _pin_memory now handles act_pre_scale — Fixed at int8_tensor.py:477-479.

Remaining items (non-blocking)

1. should_reduce_range not re-exported from torchao.quantization

The docstrings for both config classes reference should_reduce_range() as a helper, but users must import it from torchao.utils or the deep internal path torchao.quantization.quantize_.workflows.int8.int8_tensor. Re-exporting from torchao.quantization would improve discoverability:

from torchao.quantization import should_reduce_range

The function is in int8_tensor.py's __all__ (line 39) but isn't propagated up through workflows/__init__.py or torchao/quantization/__init__.py. This can be a follow-up.

2. Duplicate entries in workflows/__init__.py:__all__

Lines 44-45 and 50-51 have duplicate "Int8Tensor" and "QuantizeTensorToInt8Kwargs" entries. This predates this PR but is worth cleaning up. Not blocking.

3. None handling in reduce_range is implicit

When reduce_range=None, __post_init__ normalizes it to False. In from_hp, if reduce_range: naturally treats None as falsy. This works correctly but relies on implicit Python truthiness. A brief inline comment in from_hp noting this would help readability, but not blocking.

What looks good

  • reduce_range correctly propagated through all paths: from_hp, dequantize, both config classes, act_quant_kwargs, _choose_quant_func_and_quantize_tensor, and all aten ops (slice, select, _pin_memory, is_pinned).
  • Removing hardcoded quant_min=-128, quant_max=127 from dequantize() is correct — _dequantize_affine_no_dtype_check doesn't use these values in computation (formula is (input - zero_point) * scale), so defaulting to None → full dtype range is fine.
  • should_reduce_range() is a clean, simple helper that follows the "simple over easy" principle — users explicitly opt in.
  • Backward compatibility preserved via Optional[bool] with default False and NoneFalse normalization.
  • Test uses should_reduce_range() to dynamically determine the flag, ensuring CI passes on both VNNI and non-VNNI machines.
  • The act_pre_scale bugfixes in is_pinned and _pin_memory are good incidental improvements.

@cyxlily cyxlily requested a review from vkuzo April 20, 2026 03:33
Cui, Lily added 2 commits April 20, 2026 03:45
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>

granularity: Granularity
mapping_type: MappingType = MappingType.SYMMETRIC
reduce_range: Optional[bool] = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on why Optional, and whether we can remove it in the future?

I don't think we need to guarantee BC for old checkpoints btw

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, removed Optional as it's not needed to guarantee BC for old checkpoints.

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
@cyxlily cyxlily requested a review from jerryzh168 April 23, 2026 08:16
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

Hi @jerryzh168 Could you please review this PR again?

Comment thread torchao/quantization/quant_api.py Outdated
Comment on lines +890 to +892
assert self.reduce_range in (True, False), (
"`reduce_range` must be True or False. None is defaulted to False."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed?

Comment thread torchao/quantization/quant_api.py Outdated
)
if self.reduce_range is None:
self.reduce_range = False
assert self.reduce_range in (True, False), (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

reduce_range=reduce_range,
)
else:
act_granularity, _ = Int8Tensor._normalize_granularity(granularity)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add assert for config_mode == "static"

input_tensor = torch.randn(M, K, dtype=dtype, device=device)
model = ToyTwoLinearModel(K, N, K, dtype=dtype, device=device).eval()
model_q = copy.deepcopy(model)
reduce_range = should_reduce_range(input_tensor.device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the effect when reduce_range is not set correctly for cpu? should we test that as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. On CPUs without VNNI, when using torch.compile with int8 matmul, there may be accuracy degradation compared to eager mode due to overflow in oneDNN qlinear. The reduce_range setting is designed to prevent this.
We don't need to test incorrect settings because this UT focuses on the recommended path, and I've already provided the helper function should_reduce_range() and clear UT Notes.

Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall

Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
@Xia-Weiwen
Copy link
Copy Markdown
Collaborator

Merge as CI failures are unrelated.

@Xia-Weiwen Xia-Weiwen merged commit e094ce3 into pytorch:main Apr 28, 2026
15 of 19 checks passed
@cyxlily cyxlily deleted the reduce_range branch April 29, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants