After #8342 is merged, we can finally tune the warpspeed scan implementation. There is currently only one benchmark which is cub/benchmarks/bench/scan/exclusive/sum.warpspeed.cu. We should tune for at least the value types I8, I16, I32, I64, and I128. No offset type needs to be provided (it's always I64). The problem size should be at least 2^28.
Example using the random search:
$ CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../benchmarks/scripts ../benchmarks/scripts/search.py -R '.*scan.exclusive.sum.warpspeed' -a 'T{ct}=I8' -a 'Elements{io}[pow2]=32'
ctk: 13.1.115
cccl: v3.4.0.dev-433-ge76addccbd
cub.bench.scan.exclusive.sum.warpspeed.wrps_8.lbi_5.ipt_176 0.6979214053331619
cub.bench.scan.exclusive.sum.warpspeed.wrps_5.lbi_4.ipt_48 0.9386672015634218
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_5.ipt_208 0.8706167012214349
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_2.ipt_112 0.8389938131208066
...
cub.bench.scan.exclusive.sum.warpspeed.wrps_4.lbi_8.ipt_160 1.2642544877327193
The last run already shows a speedup of 1.26 at 4 warps, 8 look back items, and 160 items per thread.
After #8342 is merged, we can finally tune the warpspeed scan implementation. There is currently only one benchmark which is
cub/benchmarks/bench/scan/exclusive/sum.warpspeed.cu. We should tune for at least the value typesI8,I16,I32,I64, andI128. No offset type needs to be provided (it's alwaysI64). The problem size should be at least 2^28.Example using the random search:
The last run already shows a speedup of 1.26 at 4 warps, 8 look back items, and 160 items per thread.