Use in-place ops in _quantize_affine_float8 to reduce peak memory

r3t2 · facebook-github-bot · commit 74677278c675 · 2026-03-12T15:32:27.000-07:00
Summary:
[torchao] Use in-place ops in _quantize_affine_float8 to reduce peak memory

`_quantize_affine_float8` allocated up to 3 separate float32 copies of
the input tensor (via `.to()`, `/`, and `.clamp()`).  For large
activations this caused unnecessary memory pressure and OOM.

Switch to in-place `div_()` and `clamp_()` so only a single float32
copy is ever live.  Use `copy=True` on the `.to()` call to guarantee a
fresh buffer even when the input is already float32, preventing
mutation of the caller's tensor.

Differential Revision: D96350390
diff --git a/torchao/quantization/quant_primitives.py b/torchao/quantization/quant_primitives.py
@@ -2327,15 +2327,20 @@ def _quantize_affine_float8(
     """
     Quantizes the high precision floating point tensor to a float8 tensor, using the given scaling factor.
     """
-    tensor_fp32 = tensor.to(torch.float32)
+    # copy=True guarantees a fresh tensor even when the input is already fp32,
+    # so the in-place div_/clamp_ below never mutate the caller's tensor.
+    tensor_fp32 = tensor.to(torch.float32, copy=True)
 
     # Expand scale to match tensor dimensions for block-wise quantization
     scale_expanded = _maybe_expand_scale_to_tensor_shape(scale, tensor.shape)
 
-    tensor_scaled = tensor_fp32 / scale_expanded
+    # Use in-place ops to avoid allocating additional float32 copies of the
+    # full tensor.  This reduces peak memory from 3x to 1x the float32
+    # tensor size — critical for large activations (e.g. video VAE decode).
+    tensor_fp32.div_(scale_expanded)
     max_value = torch.finfo(float8_dtype).max
-    tensor_clamped = tensor_scaled.clamp(min=-max_value, max=max_value)
-    return _RoundToFloat8.apply(tensor_clamped, float8_dtype)
+    tensor_fp32.clamp_(min=-max_value, max=max_value)
+    return _RoundToFloat8.apply(tensor_fp32, float8_dtype)
 
 
 def _dequantize_affine_float8(