Design API for hierarchical parallelism

Based on the identified algorithms in #8361, we should design an API that can express the use cases mentioned in #8358 as a single kernel.

It would also be good to study existing solutions in other frameworks mentioned in #6410 

This issue can be closed once we have a PoC implementation that can show a performance improvement over the existing custom CUDA C++ kernels mentioned in #8358