The goal here is to extend the existing kernel, which processes a batch of small segments, such that large segments that cannot be processed by a thread block are pushed into a global queue that will get picked up by a subsequent kernel that assigns a (presumably load-balanced) number of CTAs to each of those segments.
Tasks:
Depends on:
The goal here is to extend the existing kernel, which processes a batch of small segments, such that large segments that cannot be processed by a thread block are pushed into a global queue that will get picked up by a subsequent kernel that assigns a (presumably load-balanced) number of CTAs to each of those segments.
Tasks:
Depends on:
cub::DeviceSegmentedTopKfor arbitrary segment sizes #8360