Skip to content

Collect use-cases requiring hierarchical parallelism #8358

@NaderAlAwar

Description

@NaderAlAwar

Description

We are planning on adding an API for hierarchical parallelism (#6410) to CUB and we would like your input to help identify use cases that would benefit from this API to help influence the design. These use cases should show instances where the existing CUB (or Thrust) API lacks certain abstractions for cooperative processing of a single logical element by multiple threads, synchronization between phases of work on related data, and composition of algorithmic stages with mismatched granularities such as group-level mask generation followed by element-level transformation.

Since these abstractions are missing, it should be possible to show that for the use cases we are looking for, a simple custom CUDA C++ kernel can outperform CUB currently.

Use cases

Each use case listed should follow this pattern:

Domain: this can be either a library name, e.g., "cuDF", or a more general domain like "LLM Inference"
Kernel name and description: If the kernel/workload is well known and has a name, this can be listed here, e.g., rms norm. A short description of what the kernel does is also needed
Existing implementations: If there is an existing implementation in a library, we want to compare it to an equivalent CUB implementation to see the performance difference between the two
Performance measurements (optional): If you can collect performance numbers comparing the CUB implementation to the custom CUDA C++ kernel that would help a lot

This issue can be closed once we have collected a sufficient number of uses cases that we can use to proceed with a PoC implementation of hierarchical parallelism

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions