Fix int8 dynamic activation quantization accuracy regression from v2 tensor migration#4326
Fix int8 dynamic activation quantization accuracy regression from v2 tensor migration#4326haotongzou wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4326
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 4 New Failures, 1 PendingAs of commit 9c98887 with merge base 67a78e5 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Hi @haotongzou! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
The v2 int8 activation quantization path uses hp_tensor.dtype as the scale_dtype in choose_qparams_affine. When the input is bfloat16, this causes scale computation in bfloat16 precision, resulting in ~0.6 dB SQNR degradation compared to the v1 code path which used float32. Add a configurable scale_dtype field to QuantizeTensorToInt8Kwargs (default: torch.float32) and thread it through Int8Tensor.from_hp(), so that the default behavior preserves accuracy while other platforms can override it if needed.
c7e09af to
1436b42
Compare
|
@claude review |
|
Claude finished @Xia-Weiwen's task in 1m 46s —— View job Code Review for PR #4326
Summary: This PR fixes a real accuracy regression where Verdict: The fix is correct and well-scoped. A few observations: What looks good
Minor observations (non-blocking)
Overall this is a clean, minimal fix for a legitimate regression. Approving. |
Xia-Weiwen
left a comment
There was a problem hiding this comment.
Users don't call from_hp directly. How do users set this flag in their script?
|
|
||
| granularity: Granularity | ||
| mapping_type: MappingType = MappingType.SYMMETRIC | ||
| scale_dtype: Optional[torch.dtype] = torch.float32 |
There was a problem hiding this comment.
To keep the original bahavior default, the default value here is better to be None and when it's None, scale_dtype = hp_tensor.dtype.
There was a problem hiding this comment.
I don't think we want to add this to API, it can just be hidden the same way as before:
we can also put up a separate PR to hardcode this to float32 and check if it's fine as well
There was a problem hiding this comment.
Simplified the patch to a one-line hardcode of scale_dtype=torch.float32, matching the old behavior without a new API. It seems to be working - accuracy is fine, and tests are passing.
Hardcode scale_dtype=torch.float32 directly in choose_qparams_affine, matching the original v1 behavior without adding new API surface.
|
@claude review |
|
Claude finished @andrewor14's task in 1m 6s —— View job Code Review for PR #4326
Summary: After the review feedback from @Xia-Weiwen and @jerryzh168, the PR was simplified to a clean one-line fix. The net diff across both commits is: - scale_dtype=hp_tensor.dtype,
+ scale_dtype=torch.float32,at Verdict: LGTM. Clean, minimal, and correct fix. What looks good
Minor note (non-blocking)The commit history shows two commits (add configurable API, then simplify to hardcode). A squash merge would keep the history clean, but that's a merge-time decision, not a code issue. Overall this is a well-scoped regression fix. No concerns. |
|
Hi @haotongzou or @claude, can you point me to the original code before #4151 that used |
|
Claude finished @andrewor14's task in 5m 0s —— View job Tracing
|
The v2 int8 dynamic activation quantization path introduced in #4151 uses
hp_tensor.dtypeasscale_dtypeinchoose_qparams_affine. When the input tensor is bfloat16, this computes quantization scales in bfloat16 precision instead of float32, causing ~0.6 dB SQNR degradation and measurable accuracy loss on downstream tasks (e.g. MMLU on Llama-3.1-8B-Instruct).This PR replaces
hp_tensor.dtypewith configurablescale_dtypeso it is configurable per-platform.Regression Cause
The original code path hardcoded
scale_dtype=torch.float32. The refactor in #4151 changed this tohp_tensor.dtype, which for bfloat16 models means scale computation happens at reduced precision. Since bfloat16 has only 8 bits of mantissa (vs float32's 24), the quantization scales lose precision, degrading output quality.Validation
Llama-3.1-8B-Instructrestores top-1 accuracy in tests to v1.No built-in tests should be affected.