Modify LTimes to match Kripke LTimes#684
Conversation
… inner loop. Grid/block size changes to match Kripke Ltimessdom
|
Yes you would have to make LTIMES and LTIMES_NO_VIEW match. |
Done |
| RAJA::statement::CudaKernelAsync< | ||
| RAJA::statement::For<1, RAJA::cuda_block_x_loop, // z | ||
| RAJA::statement::For<2, RAJA::cuda_block_y_loop, // g | ||
| RAJA::statement::For<3, RAJA::cuda_thread_x_loop, // m |
There was a problem hiding this comment.
I assume this is a non-size loop policy because it is in ltimes. Here we know the block size at compile time, is that also true in kripke?
There was a problem hiding this comment.
In Kripke, the block sizes are determined by parameters passed in at runtime. For example, like in this version of LTimes, it will be blocked on zones and groups. The exact parameters which will be blocked are not always the same for each Kripke run. For instance, if we use the DZG layout at runtime, then the loops will be blocked with directions and zones (while groups are threaded).
| static const size_t default_gpu_block_size = 256; | ||
| using gpu_block_sizes_type = integer::make_gpu_block_size_list_type<default_gpu_block_size, | ||
| integer::MultipleOf<32>>; | ||
| static const size_t default_gpu_block_size = 25; |
There was a problem hiding this comment.
How was a block size of 25 chosen? Would it be better for both GPU platforms to set this to 32?
There was a problem hiding this comment.
It is the square of legendre order + 1. In kripke this is also the case, the default legendre is 4, so (4+1)^2 means the kernel in kripke will be (5,5,1)
In RAJAPerf, we set m=25 directly (and this is the default). So m=36 would then be equivalent to legendre=5 in kripke.
| RAJA::statement::For<1, RAJA::cuda_global_size_z_direct<z_block_sz>, //z | ||
| RAJA::statement::For<2, RAJA::cuda_global_size_y_direct<g_block_sz>, //g | ||
| RAJA::statement::For<3, RAJA::cuda_global_size_x_direct<m_block_sz>, //m | ||
| RAJA::statement::CudaKernelAsync< |
There was a problem hiding this comment.
Note that I've reverted LTimes to launch synchronously in Kripke, for correctness. This is fine though because the direction loop is inner-most, which should avoid race conditions.
There was a problem hiding this comment.
It has been async in RAJAPerf, I just changed it from CudaKernelFixedAsync to CudaKernelAsync, but for completeness I can make it CudaKernel. I don't this would matter for performance in RAJAPerf.
|
Do we want to only have this kripke conforming tuning, or should does it make sense to keep the current tuning as well? |
|
@michaelmckinsey1 take a look here: https://github.com/llnl/RAJA/blob/develop/benchmark/ltimes.cpp, it would be cool to also have a GPU shared memory version as a tuning! |
Matching parameters like zones, groups, moments, and directions, the runtime of RAJAPerf LTimes is equivalent (<1%) to Kripke LTimes on CPU. But I have noticed on GPU, runtime varies by ~15% on CUDA and ~60% on ROCm. After these changes CUDA runtime is within 1% and ROCm within 6% (CUDA is now faster, ROCm actually slower with block 25). This is because AMD wavefront size 64 on block 25 wastes significantly more threads than NVIDIA warp size 32.
Summary
elllayout fromdcontiguous tomcontiguous to make accesses coalesced. This matches Kripke ZGD, whereField_Ellis layout orderingmomentas stride-1. Doesn't apply to psi because does not depend on m (threads access same value), phi because no d (already contiguous inm).minstead of 256 and remapped the kernel so z -> blockIdx.x, g -> blockIdx.y, m -> threadIdx.x. This changes the launch togrid=(num_z, num_g, 1), block=(m, 1, 1), which matches the kripke launch. However, running at different legendre orders (m) will result in different block sizes.Would need to make similar changes fordoneLTIMES-NOVIEW?