Support for DASH optimizer (Faster Shampoo)

Hi! We have recently released **DASH** (**D**istributed **A**ccelerated **SH**ampoo) optimizer and I would like to suggest integrating it into `pytorch_optimizer`.

**DASH achieves 5x faster optimizer step time and better validation perplexity compared to Distributed Shampoo on LLM pretraining.**

## Paper or Code
Paper: [https://arxiv.org/pdf/2602.02016](https://arxiv.org/pdf/2602.02016)
Training Code: [https://github.com/IST-DASLab/DASH](https://github.com/IST-DASLab/DASH)
Optimizer Code: [DashGpu](https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/blob/main/ista_daslab_optimizers/dash/dash_gpu.py)

## Brief details of the optimizer
Our implementation builds on top of the Distributed Shampoo optimizer, which we observed is quite slow because the preconditioner blocks are processed sequentially.

### Engineering contribution
We leverage the key observation that the preconditioner blocks have the same shape and therefore we can stack them into a 3D tensor and benefit from the batch-matrix-multiplication procedures that are built-in in PyTorch (even Eigen-Value Decomposition works with batches of matrices) instead of processing them sequentially. 

This greatly benefits the iterative procedures to compute inverse matrix roots that are mainly based on matrix multiplications.

### Research contribution
We apply an existing advanced linear algebra technique called Newton-Denman-Beavers (NewtonDB) to compute square root and inverse square root of matrices in the context of optimization for deep learning.

The existing Coupled-Newton method already implemented in Distributed Shampoo had convergence issues and therefore the Eigen-Value Decomposition is still the default method to compute inverses.

These iterative procedures (NewtonDB and Coupled Newton) require the input matrix to have the largest eigen-value upper bounded by 1 in order to converge.

The input matrix is scaled by Frobenius norm to achieve this goal, but unfortunately the gap between Frobenius norm and the real largest eigen-value is very large, therefore the eigen-values are pushed towards zero, a regime where both NewtonDB and CoupledNewton require more iterations to converge.

Since our preconditioner blocks are stacked, we can estimate the largest eigen-value of all blocks in one-shot using Power-Iteration in `half-precision (bfloat16)`, thus making this important procedure cheap in the context of optimization for deep learning.

We show that the number of steps for convergence decreases when the largest eigen-value increases for NewtonDB, but not for Coupled Newton.

Therefore, using NewtonDB in our DASH optimizer achieves superior results compared to Coupled Newton and Eigen-Value Decomposition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DASH optimizer (Faster Shampoo) #497

Paper or Code

Brief details of the optimizer

Engineering contribution

Research contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support for DASH optimizer (Faster Shampoo) #497

Description

Paper or Code

Brief details of the optimizer

Engineering contribution

Research contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions