We want to support NVFP4 training in torchao for both dense and MoE models, following the recipe describe in this paper from NVIDIA.
Support for dense models (linears)
Support for MoE layers (grouped GEMMs)
We can extend the low precision MoE training code here to support NVFP4 by doing the following:
We want to support NVFP4 training in torchao for both dense and MoE models, following the recipe describe in this paper from NVIDIA.
Support for dense models (linears)
cvt.rs.satfinite.e2m1x4.f32 d, {a, b, e, f}, rbits; // convert 4 fp32 values to packed 4 e2m1 values with applying .rs rounding) https://docs.nvidia.com/cuda/parallel-thread-execution/#rounding-modifiers*.rnmodifier).rnmodifier, same as abovewgrad = grad_out.t() @ inputwe won't need SR for 16x1 scaling / RHS operand)output = input @ weight.t()dgrad = grad_output @ weightwgrad = grad_output.t() @ inputtorch.compilecomposabilityquantize_model conversion api peforming module swap of nn.Linear to NVFP4Linear (wraps autograd func)Support for MoE layers (grouped GEMMs)
We can extend the low precision MoE training code here to support NVFP4 by doing the following:
output = input @ weight.transpose(-2,-1)dgrad = grad_output @ weightwgrad = grad_output.transpose(-2,-1) @ inputtorch.compilecomposability