Fix int8 dynamic activation quantization accuracy regression from v2 tensor migration by haotongzou · Pull Request #4326 · pytorch/ao

haotongzou · 2026-04-24T08:00:57Z

The v2 int8 dynamic activation quantization path introduced in #4151 uses hp_tensor.dtype as scale_dtype in choose_qparams_affine. When the input tensor is bfloat16, this computes quantization scales in bfloat16 precision instead of float32, causing ~0.6 dB SQNR degradation and measurable accuracy loss on downstream tasks (e.g. MMLU on Llama-3.1-8B-Instruct).

This PR replaces hp_tensor.dtype with configurable scale_dtype so it is configurable per-platform.

Regression Cause

The original code path hardcoded scale_dtype=torch.float32. The refactor in #4151 changed this to hp_tensor.dtype, which for bfloat16 models means scale computation happens at reduced precision. Since bfloat16 has only 8 bits of mantissa (vs float32's 24), the quantization scales lose precision, degrading output quality.

Validation

Llama-3.1-8B-Instruct restores top-1 accuracy in tests to v1.

No built-in tests should be affected.

pytorch-bot · 2026-04-24T08:01:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4326

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 4 New Failures, 1 Pending

As of commit 9c98887 with merge base 67a78e5 ():

NEW FAILURES - The following jobs have failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
test/integration/test_integration.py::PythonQuantUtilOpUnitTest::test_dynamic_quant_per_channel_numerics_cpu
Run 1xL4 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
test/integration/test_integration.py::PythonQuantUtilOpUnitTest::test_dynamic_quant_per_channel_numerics_cpu
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch torchvision --index-url https://download.py... / linux-job (gh)
test/quantization/test_quant_primitives.py::TestQuantPrimitives::test_get_group_qparams_symmetric
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch torchvision --index-url htt... / linux-job (gh)
test/quantization/test_quant_primitives.py::TestQuantPrimitives::test_get_group_qparams_symmetric

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla · 2026-04-24T08:01:03Z

Hi @haotongzou!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

The v2 int8 activation quantization path uses hp_tensor.dtype as the scale_dtype in choose_qparams_affine. When the input is bfloat16, this causes scale computation in bfloat16 precision, resulting in ~0.6 dB SQNR degradation compared to the v1 code path which used float32. Add a configurable scale_dtype field to QuantizeTensorToInt8Kwargs (default: torch.float32) and thread it through Int8Tensor.from_hp(), so that the default behavior preserves accuracy while other platforms can override it if needed.

Xia-Weiwen · 2026-04-24T09:07:39Z

@claude review

claude · 2026-04-24T09:07:58Z

Claude finished @Xia-Weiwen's task in 1m 46s —— View job

Code Review for PR #4326

Read the changed files and understand the diff
Review the code for correctness, style, and potential issues
Post review feedback

Summary: This PR fixes a real accuracy regression where scale_dtype was inadvertently changed from torch.float32 to hp_tensor.dtype during the v2 tensor migration (#4151). For bfloat16 models, this caused quantization scales to be computed at reduced precision (8 bits of mantissa vs 24), leading to measurable SQNR degradation.

Verdict: The fix is correct and well-scoped. A few observations:

What looks good

Root cause is accurately identified. The scale_dtype=hp_tensor.dtype on the old line 193 (int8_tensor.py) would use bfloat16 for bfloat16 models, degrading scale precision. Restoring torch.float32 as the default is the right fix.
Defensive default. The scale_dtype if scale_dtype is not None else torch.float32 pattern at int8_tensor.py:193 ensures float32 even if None is explicitly passed, which is reasonable.
Dataclass field addition is clean. Adding scale_dtype: Optional[torch.dtype] = torch.float32 to QuantizeTensorToInt8Kwargs is backwards-compatible since it has a default value, and existing callers (lines 895, 992, 1022 in quant_api.py) that don't pass scale_dtype will get the correct torch.float32 default.
Plumbing through _choose_quant_func_and_quantize_tensor (quantize_tensor_kwargs.py:68) correctly reads from the kwargs and passes it to Int8Tensor.from_hp. The Float8 path is unaffected since it returns early before reaching this code.

Minor observations (non-blocking)

Optional[torch.dtype] with non-None default: The scale_dtype: Optional[torch.dtype] = torch.float32 pattern (both in the dataclass and the from_hp method signature) allows None as a valid value, then falls back to torch.float32 anyway. This means None and torch.float32 are semantically identical, which is slightly redundant. Consider whether torch.dtype (non-Optional) would be cleaner, since there's no use case where None means something different from torch.float32. That said, this is minor and the defensive fallback is fine as-is.
Config-level exposure: The Int8DynamicActivationInt8WeightConfig and Int8StaticActivationInt8WeightConfig don't expose scale_dtype yet — they construct QuantizeTensorToInt8Kwargs without passing it (e.g., quant_api.py:895-898). Since the PR description says "configurable per-platform," users currently have no way to override this through the public config API. If the intent is user configurability (not just restoring the correct default), a follow-up PR could add scale_dtype to these config classes. For now, restoring the correct default is the important fix.

Overall this is a clean, minimal fix for a legitimate regression. Approving.

Xia-Weiwen

Users don't call from_hp directly. How do users set this flag in their script?

Xia-Weiwen · 2026-04-24T09:20:44Z


    granularity: Granularity
    mapping_type: MappingType = MappingType.SYMMETRIC
+    scale_dtype: Optional[torch.dtype] = torch.float32


To keep the original bahavior default, the default value here is better to be None and when it's None, scale_dtype = hp_tensor.dtype.

I don't think we want to add this to API, it can just be hidden the same way as before:

https://raw.githubusercontent.com/pytorch/ao/662c670db64da7c3c0a66d626a40311125838316/torchao/quantization/quant_api.py

we can also put up a separate PR to hardcode this to float32 and check if it's fine as well

Simplified the patch to a one-line hardcode of scale_dtype=torch.float32, matching the old behavior without a new API. It seems to be working - accuracy is fine, and tests are passing.

Hardcode scale_dtype=torch.float32 directly in choose_qparams_affine, matching the original v1 behavior without adding new API surface.

andrewor14 · 2026-04-27T15:27:44Z

@claude review

claude · 2026-04-27T15:28:08Z

Claude finished @andrewor14's task in 1m 6s —— View job

Code Review for PR #4326

Read the changed files and understand the diff
Review the code for correctness and check surrounding context
Post review feedback

Summary: After the review feedback from @Xia-Weiwen and @jerryzh168, the PR was simplified to a clean one-line fix. The net diff across both commits is:

-                scale_dtype=hp_tensor.dtype,
+                scale_dtype=torch.float32,

at int8_tensor.py:189. This restores the original pre-#4151 behavior without introducing any new API surface.

Verdict: LGTM. Clean, minimal, and correct fix.

What looks good

Root cause correctly identified and fixed. When hp_tensor is bfloat16, hp_tensor.dtype passes bfloat16 as scale_dtype to choose_qparams_affine, which computes scales at reduced precision (8-bit mantissa vs 24-bit). Hardcoding torch.float32 restores the original behavior.
Simplified per reviewer feedback. The first commit added a configurable scale_dtype parameter to QuantizeTensorToInt8Kwargs and from_hp. The second commit correctly reverted that, keeping only the one-line fix. The net result is zero API change — no new parameters, no new dataclass fields, no changes to quantize_tensor_kwargs.py.
Consistent with the existing eps fix on the next line. Line 193 already hardcodes torch.finfo(torch.float32).eps with a comment explaining why bfloat16 eps would hurt accuracy. The scale_dtype=torch.float32 fix follows the same principle and is well-motivated by the same reasoning.

Minor note (non-blocking)

The commit history shows two commits (add configurable API, then simplify to hardcode). A squash merge would keep the history clean, but that's a merge-time decision, not a code issue.

Overall this is a well-scoped regression fix. No concerns.

andrewor14 · 2026-04-27T15:30:03Z

Hi @haotongzou or @claude, can you point me to the original code before #4151 that used scale_dtype as torch.float32?

claude · 2026-04-27T15:30:29Z

Claude finished @andrewor14's task in 5m 0s —— View job

Tracing `scale_dtype=torch.float32` before #4151

Find the original code that hardcoded scale_dtype=torch.float32
Post findings with file paths and line references

Hi @andrewor14, here's what I found:

The buggy code introduced by #4151 was in torchao/quantization/quantize_/workflows/int8/int8_tensor.py in Int8Tensor.from_hp():

scale_dtype=hp_tensor.dtype,  # <-- this was the bug

When hp_tensor is bfloat16, this computes scales in bfloat16 instead of float32.

The original pre-#4151 v1 code lived in two places:

torchao/quantization/quant_api.py at commit 662c670 — this is the link @jerryzh168 referenced in his review comment. The old v1 Int8DynamicActivationInt8WeightConfig path (now deleted, version 1 raises ValueError) used _quantize_activation_per_token_absmax for the activation quantization at inference time.
torchao/quantization/utils.py:162-192 — the function _quantize_activation_per_token_absmax, which still exists today and contains:
```
# utils.py line 175
scale_dtype = torch.float32 if t.dtype == torch.float16 else None
```
When scale_dtype=None, choose_qparams_affine falls back to input.dtype (see quant_primitives.py:1524-1525):
```
if scale_dtype is None:
    scale_dtype = input.dtype
```
So for the old v1 path, bfloat16 inputs actually also used bfloat16 for scale_dtype. The float32 override was only applied for float16 inputs.

However, there's an important nuance: the eps value also matters for accuracy. The old v1 code used eps = 1e-5 (a fixed float32-safe constant), while the new v2 code in this PR now correctly hardcodes eps=torch.finfo(torch.float32).eps with a comment at line 193 explaining why. So the combination of scale_dtype=torch.float32 and eps=torch.finfo(torch.float32).eps in the current fix provides strictly better numerical behavior than the old v1 path for bfloat16 models.

Summary: The original scale_dtype=torch.float32 that this PR restores did not exist verbatim in the old v1 code path for bfloat16 — the v1 path used None (which resolved to input.dtype). The v2 migration in #4151 made this explicit as hp_tensor.dtype, preserving the same bfloat16 behavior. This PR improves on the original v1 behavior by unconditionally using float32 for scale computation, which gives better accuracy for bfloat16 models.

haotongzou force-pushed the fix-int8-activation-accuracy-regression branch from c7e09af to 1436b42 Compare April 24, 2026 08:57

Xia-Weiwen approved these changes Apr 24, 2026

View reviewed changes

Xia-Weiwen added the module: inference quantize_ api inference flow label Apr 24, 2026

Xia-Weiwen self-requested a review April 24, 2026 09:16

Xia-Weiwen reviewed Apr 24, 2026

View reviewed changes

Simplify: hardcode scale_dtype=torch.float32 instead of exposing API

9c98887

Hardcode scale_dtype=torch.float32 directly in choose_qparams_affine, matching the original v1 behavior without adding new API surface.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026

haotongzou marked this pull request as ready for review April 27, 2026 07:15

haotongzou requested a review from andrewor14 as a code owner April 27, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int8 dynamic activation quantization accuracy regression from v2 tensor migration#4326

Fix int8 dynamic activation quantization accuracy regression from v2 tensor migration#4326
haotongzou wants to merge 2 commits intopytorch:mainfrom
haotongzou:fix-int8-activation-accuracy-regression

haotongzou commented Apr 24, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

meta-cla Bot commented Apr 24, 2026

Uh oh!

Xia-Weiwen commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Xia-Weiwen left a comment

Uh oh!

Xia-Weiwen Apr 24, 2026

Uh oh!

jerryzh168 Apr 24, 2026 •

edited

Loading

Uh oh!

haotongzou Apr 27, 2026

Uh oh!

andrewor14 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

andrewor14 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

haotongzou commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Cause

Validation

Uh oh!

pytorch-bot Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4326

❗ 1 Active SEVs

❌ 4 New Failures, 1 Pending

Uh oh!

meta-cla Bot commented Apr 24, 2026

Action Required

Process

Uh oh!

Xia-Weiwen commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #4326

What looks good

Minor observations (non-blocking)

Uh oh!

Xia-Weiwen left a comment

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haotongzou Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #4326

What looks good

Minor note (non-blocking)

Uh oh!

andrewor14 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tracing scale_dtype=torch.float32 before #4151

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haotongzou commented Apr 24, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 24, 2026 •

edited

Loading

claude Bot commented Apr 24, 2026 •

edited

Loading

jerryzh168 Apr 24, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

claude Bot commented Apr 27, 2026 •

edited

Loading

Tracing `scale_dtype=torch.float32` before #4151