Hi! We have recently released DASH (Distributed Accelerated SHampoo) optimizer and I would like to suggest integrating it into pytorch_optimizer.
DASH achieves 5x faster optimizer step time and better validation perplexity compared to Distributed Shampoo on LLM pretraining.
Paper or Code
Paper: https://arxiv.org/pdf/2602.02016
Training Code: https://github.com/IST-DASLab/DASH
Optimizer Code: DashGpu
Brief details of the optimizer
Our implementation builds on top of the Distributed Shampoo optimizer, which we observed is quite slow because the preconditioner blocks are processed sequentially.
Engineering contribution
We leverage the key observation that the preconditioner blocks have the same shape and therefore we can stack them into a 3D tensor and benefit from the batch-matrix-multiplication procedures that are built-in in PyTorch (even Eigen-Value Decomposition works with batches of matrices) instead of processing them sequentially.
This greatly benefits the iterative procedures to compute inverse matrix roots that are mainly based on matrix multiplications.
Research contribution
We apply an existing advanced linear algebra technique called Newton-Denman-Beavers (NewtonDB) to compute square root and inverse square root of matrices in the context of optimization for deep learning.
The existing Coupled-Newton method already implemented in Distributed Shampoo had convergence issues and therefore the Eigen-Value Decomposition is still the default method to compute inverses.
These iterative procedures (NewtonDB and Coupled Newton) require the input matrix to have the largest eigen-value upper bounded by 1 in order to converge.
The input matrix is scaled by Frobenius norm to achieve this goal, but unfortunately the gap between Frobenius norm and the real largest eigen-value is very large, therefore the eigen-values are pushed towards zero, a regime where both NewtonDB and CoupledNewton require more iterations to converge.
Since our preconditioner blocks are stacked, we can estimate the largest eigen-value of all blocks in one-shot using Power-Iteration in half-precision (bfloat16), thus making this important procedure cheap in the context of optimization for deep learning.
We show that the number of steps for convergence decreases when the largest eigen-value increases for NewtonDB, but not for Coupled Newton.
Therefore, using NewtonDB in our DASH optimizer achieves superior results compared to Coupled Newton and Eigen-Value Decomposition.
Hi! We have recently released DASH (Distributed Accelerated SHampoo) optimizer and I would like to suggest integrating it into
pytorch_optimizer.DASH achieves 5x faster optimizer step time and better validation perplexity compared to Distributed Shampoo on LLM pretraining.
Paper or Code
Paper: https://arxiv.org/pdf/2602.02016
Training Code: https://github.com/IST-DASLab/DASH
Optimizer Code: DashGpu
Brief details of the optimizer
Our implementation builds on top of the Distributed Shampoo optimizer, which we observed is quite slow because the preconditioner blocks are processed sequentially.
Engineering contribution
We leverage the key observation that the preconditioner blocks have the same shape and therefore we can stack them into a 3D tensor and benefit from the batch-matrix-multiplication procedures that are built-in in PyTorch (even Eigen-Value Decomposition works with batches of matrices) instead of processing them sequentially.
This greatly benefits the iterative procedures to compute inverse matrix roots that are mainly based on matrix multiplications.
Research contribution
We apply an existing advanced linear algebra technique called Newton-Denman-Beavers (NewtonDB) to compute square root and inverse square root of matrices in the context of optimization for deep learning.
The existing Coupled-Newton method already implemented in Distributed Shampoo had convergence issues and therefore the Eigen-Value Decomposition is still the default method to compute inverses.
These iterative procedures (NewtonDB and Coupled Newton) require the input matrix to have the largest eigen-value upper bounded by 1 in order to converge.
The input matrix is scaled by Frobenius norm to achieve this goal, but unfortunately the gap between Frobenius norm and the real largest eigen-value is very large, therefore the eigen-values are pushed towards zero, a regime where both NewtonDB and CoupledNewton require more iterations to converge.
Since our preconditioner blocks are stacked, we can estimate the largest eigen-value of all blocks in one-shot using Power-Iteration in
half-precision (bfloat16), thus making this important procedure cheap in the context of optimization for deep learning.We show that the number of steps for convergence decreases when the largest eigen-value increases for NewtonDB, but not for Coupled Newton.
Therefore, using NewtonDB in our DASH optimizer achieves superior results compared to Coupled Newton and Eigen-Value Decomposition.