Commit 7467727
Use in-place ops in _quantize_affine_float8 to reduce peak memory
Summary:
[torchao] Use in-place ops in _quantize_affine_float8 to reduce peak memory
`_quantize_affine_float8` allocated up to 3 separate float32 copies of
the input tensor (via `.to()`, `/`, and `.clamp()`). For large
activations this caused unnecessary memory pressure and OOM.
Switch to in-place `div_()` and `clamp_()` so only a single float32
copy is ever live. Use `copy=True` on the `.to()` call to guarantee a
fresh buffer even when the input is already float32, preventing
mutation of the caller's tensor.
Differential Revision: D963503901 parent 95d366c commit 7467727
1 file changed
Lines changed: 9 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2327 | 2327 | | |
2328 | 2328 | | |
2329 | 2329 | | |
2330 | | - | |
| 2330 | + | |
| 2331 | + | |
| 2332 | + | |
2331 | 2333 | | |
2332 | 2334 | | |
2333 | 2335 | | |
2334 | 2336 | | |
2335 | | - | |
| 2337 | + | |
| 2338 | + | |
| 2339 | + | |
| 2340 | + | |
2336 | 2341 | | |
2337 | | - | |
2338 | | - | |
| 2342 | + | |
| 2343 | + | |
2339 | 2344 | | |
2340 | 2345 | | |
2341 | 2346 | | |
| |||
0 commit comments