Tune warpspeed scan for B200 (sm100)

After #8342 is merged, we can finally tune the warpspeed scan implementation. There is currently only one benchmark which is `cub/benchmarks/bench/scan/exclusive/sum.warpspeed.cu`. We should tune for at least the value types `I8`, `I16`, `I32`, `I64`, and `I128`. No offset type needs to be provided (it's always `I64`). The problem size should be at least 2^28.

Example using the random search:
```
$ CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../benchmarks/scripts ../benchmarks/scripts/search.py -R '.*scan.exclusive.sum.warpspeed' -a 'T{ct}=I8' -a 'Elements{io}[pow2]=32'
 ctk:  13.1.115
cccl:  v3.4.0.dev-433-ge76addccbd
cub.bench.scan.exclusive.sum.warpspeed.wrps_8.lbi_5.ipt_176 0.6979214053331619
cub.bench.scan.exclusive.sum.warpspeed.wrps_5.lbi_4.ipt_48 0.9386672015634218
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_5.ipt_208 0.8706167012214349
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_2.ipt_112 0.8389938131208066
...
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_8.ipt_160 1.2642544877327193
```
The last run already shows a speedup of 1.26 at 4 warps, 8 look back items, and 160 items per thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune warpspeed scan for B200 (sm100) #8348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tune warpspeed scan for B200 (sm100) #8348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions