Based on the identified algorithms in #8361, we should design an API that can express the use cases mentioned in #8358 as a single kernel.
It would also be good to study existing solutions in other frameworks mentioned in #6410
This issue can be closed once we have a PoC implementation that can show a performance improvement over the existing custom CUDA C++ kernels mentioned in #8358
Based on the identified algorithms in #8361, we should design an API that can express the use cases mentioned in #8358 as a single kernel.
It would also be good to study existing solutions in other frameworks mentioned in #6410
This issue can be closed once we have a PoC implementation that can show a performance improvement over the existing custom CUDA C++ kernels mentioned in #8358