Add reduce_range to avoid overflow in int8 tensor#4266
Add reduce_range to avoid overflow in int8 tensor#4266Xia-Weiwen merged 16 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4266
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 4 New FailuresAs of commit 4877828 with merge base b3e0db2 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Add reduce_range optional flag. Default False. If True, use reduced int8 range [-64, 63] instead of full range [-128, 127] to reduce overflow risk on platforms without VNNI instructions. Kept optional for backward compatibility with older call sites and serialized configs. Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
@claude review |
|
Claude finished @Xia-Weiwen's task in 2m 56s —— View job Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor
SummaryThis PR adds a Issues1.
# quant_api.py:976
assert self.reduce_range in (True, False), (
"`reduce_range` must be True or False"
)Since 2. Test skip logic may be too aggressive In the test at if not reduce_range and not _cpu_is_vnni_supported():
self.skipTest(
"Only test reduce_range=True on CPUs without VNNI support to avoid int8 overflow."
)This skips the 3. Test uses SQNR threshold instead of Xia-Weiwen's review comment suggested using assert compute_error(output_fp, output_quantized) > 20This only validates that quantization doesn't catastrophically degrade quality, but doesn't verify correctness against a reference implementation. For a feature that changes quantization ranges, verifying that the output matches fake-quantized reference output would be a stronger test. Nits / Minor Suggestions4. The reduced range
Consider defining these as module-level constants (e.g., 5.
What looks good
| |
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
Thanks for the suggestion.
|
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
@claude review again |
|
Claude finished @Xia-Weiwen's task in 2m 28s —— View job Re-review of PR #4266: Add reduce_range to avoid overflow in int8 tensor
SummaryThe PR is in good shape. The Issues1. Unresolved reviewer comment: move @Xia-Weiwen commented that 2. The diff removes duplicate 3. My previous review flagged that Nit4. Test SQNR assertion silently passes on non-VNNI CPUs with At if reduce_range or _cpu_is_vnni_supported():
assert compute_error(output_fp, output_quantized) > 20, (...)When if reduce_range or _cpu_is_vnni_supported():
assert compute_error(output_fp, output_quantized) > 20, (...)
else:
# At minimum, verify no NaN/Inf in output
assert not torch.isnan(output_quantized).any(), "Output contains NaN"
assert not torch.isinf(output_quantized).any(), "Output contains Inf"What looks good
|
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
| Args: | ||
| granularity: the granularity for the Tensor, currently either PerRow() or PerTensor() | ||
| mapping_type: whether to use symmetric or asymmetric quant | ||
| reduce_range: optional flag. If True, use reduced int8 range [-64, 63] |
There was a problem hiding this comment.
nit: we should be consistent on using numbers v.s. _REDUCED_QUANT_MIN etc. I think, right now the quant_api is using the variable but here it's using variables.
There was a problem hiding this comment.
OK to use variables but might also be good to mention what these values are in the docstring as well
There was a problem hiding this comment.
as commented in #3784 (comment) I'm wondering if we can just set it automatically and print out a warning, instead of leave this to user?
There was a problem hiding this comment.
Thanks for comments, updated.
|
|
||
|
|
||
| @common_utils.instantiate_parametrized_tests | ||
| class TestInt8TensorCPU(TorchAOIntegrationTestCase): |
There was a problem hiding this comment.
nit: maybe create a new test_int8_tensor_cpu.py it will be easier to separate tests by device for CI in the future if we want.
There was a problem hiding this comment.
Thanks for comments, updated.
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
| _FULL_QUANT_MAX = 127 | ||
|
|
||
|
|
||
| def _should_use_reduced_range(tensor: torch.Tensor) -> bool: |
There was a problem hiding this comment.
I think the user should set this and torchao can make it easy to pick the right setting, automagically setting this based on hardware is a footgun
There was a problem hiding this comment.
OK, we can expose this as an arg, and by default it will be autopicked based on device, and don't require people to set it?
# reduce_range auto-picked, but can be overwritten
# by default, auto picked based on device
config = Int8DynamicActivationInt8WeightConfig()
# explicit overwrite
config = Int8DynamicActivationInt8WeightConfig(reduce_range=True/False)
or are you talking about user should always explicitly set this arg in the config, like changing all callsites to:
config = Int8DynamicActivationInt8WeightConfig(reduce_range=get_recommended_reduce_range())
There was a problem hiding this comment.
@vkuzo @jerryzh168 Thanks for the comments. To make UX better, we can probably do as @jerryzh168 suggested:
- if users don't set it explicitly, it is decided automatically (reduce_range=true if on CPU and no VNNI, false otherwise)
- if users set it explicitly, we just use whichever the users set and the users are responsible for the results.
How does that sound to you? Thanks.
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
| Args: | ||
| granularity: the granularity for the Tensor, currently either PerRow() or PerTensor() | ||
| mapping_type: whether to use symmetric or asymmetric quant | ||
| reduce_range: If None, choose the int8 quantization range automatically. |
There was a problem hiding this comment.
I'm going to quote "simple over easy" here: let's make it simple by requiring the user to specify numerics upfront. It's ok to ask users on hardware that needs reduce_range to specify the required flag manually.
There was a problem hiding this comment.
@vkuzo Thanks for the comments. Just to confirm, you don't have concern if reduce_range is optional and if it's None we use reduce_range=False, do you? This is for backward compatibility in case where people saved their quantized model without this flag and want to load it with a new version of torchao.
There was a problem hiding this comment.
yes, that sounds good. It's also find to provide a helper function which returns the "correct" value based on hardware, but the default value should be simple and deterministic (reduce_range False).
There was a problem hiding this comment.
Thanks for the comments. I've set reduce_range default to False. @vkuzo @jerryzh168 please review.
Set reduce_range default to False. Users can use should_reduce_range() to determine.
|
@claude review |
|
Claude finished @Xia-Weiwen's task in 2m 26s —— View job Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor
SummaryThe PR is well-structured. The Issue: Dead code reintroduced from deleted PR #4211
These should be removed to avoid contradicting the explicit cleanup in #4211. Minor items (non-blocking)1. The 2. The docstrings for both from torchao.quantization import should_reduce_range3. The validation accepts What looks good
|
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
@claude review |
|
Claude finished @Xia-Weiwen's task in 1m 59s —— View job Review of PR #4266: Add reduce_range to avoid overflow in int8 tensor
SummaryThe PR is in good shape and ready for merge. All major items from previous reviews have been addressed. The Previous review items — status
Remaining items (non-blocking)1. The docstrings for both config classes reference from torchao.quantization import should_reduce_rangeThe function is in 2. Duplicate entries in Lines 44-45 and 50-51 have duplicate 3. When What looks good
|
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
|
||
| granularity: Granularity | ||
| mapping_type: MappingType = MappingType.SYMMETRIC | ||
| reduce_range: Optional[bool] = False |
There was a problem hiding this comment.
can you add a comment on why Optional, and whether we can remove it in the future?
I don't think we need to guarantee BC for old checkpoints btw
There was a problem hiding this comment.
Thanks for the comments, removed Optional as it's not needed to guarantee BC for old checkpoints.
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
Hi @jerryzh168 Could you please review this PR again? |
| assert self.reduce_range in (True, False), ( | ||
| "`reduce_range` must be True or False. None is defaulted to False." | ||
| ) |
| ) | ||
| if self.reduce_range is None: | ||
| self.reduce_range = False | ||
| assert self.reduce_range in (True, False), ( |
| reduce_range=reduce_range, | ||
| ) | ||
| else: | ||
| act_granularity, _ = Int8Tensor._normalize_granularity(granularity) |
There was a problem hiding this comment.
nit: add assert for config_mode == "static"
| input_tensor = torch.randn(M, K, dtype=dtype, device=device) | ||
| model = ToyTwoLinearModel(K, N, K, dtype=dtype, device=device).eval() | ||
| model_q = copy.deepcopy(model) | ||
| reduce_range = should_reduce_range(input_tensor.device) |
There was a problem hiding this comment.
what's the effect when reduce_range is not set correctly for cpu? should we test that as well?
There was a problem hiding this comment.
Thanks for the comments. On CPUs without VNNI, when using torch.compile with int8 matmul, there may be accuracy degradation compared to eager mode due to overflow in oneDNN qlinear. The reduce_range setting is designed to prevent this.
We don't need to test incorrect settings because this UT focuses on the recommended path, and I've already provided the helper function should_reduce_range() and clear UT Notes.
Signed-off-by: Cui, Lily <yuxin.cui@intel.com>
|
Merge as CI failures are unrelated. |
Add reduce_range optional flag. Default False. If True, use reduced int8 range [-64, 63] instead of full range [-128, 127] to reduce overflow risk on platforms without VNNI instructions. Kept optional for backward compatibility with older call sites and serialized configs.