[WIP] MASS3DPA element batching#693
Conversation
| // linear | ||
| // constexpr RAJA::Index_type D1D = 2; | ||
| // constexpr RAJA::Index_type Q1D = 2; | ||
| // } | ||
| // quadratic | ||
| // constexpr RAJA::Index_type D1D = 4; | ||
| // constexpr RAJA::Index_type Q1D = 4; | ||
| // cubic | ||
| // constexpr RAJA::Index_type D1D = 6; | ||
| // constexpr RAJA::Index_type Q1D = 6; | ||
| constexpr RAJA::Index_type D1D = 2; | ||
| constexpr RAJA::Index_type Q1D = 2; | ||
|
|
||
| constexpr RAJA::Index_type TBATCH = 16; // linear | ||
| // constexpr RAJA::Index_type TBATCH = 2; // quadratic | ||
| // constexpr RAJA::Index_type TBATCH = 1; // cubic |
There was a problem hiding this comment.
This is my observation of the values via Laghos. Looking for feedback here and whether we should try to have different configurations of MASS3DPA for different orders?
There was a problem hiding this comment.
First what do we want to cover? If we want to do more, I have thought about templating entire kernels to handle things like different orders and maybe using explicit template instantiation to keep the instantiation in separate files similar.
There was a problem hiding this comment.
It would also be helpful to see how the performance of quadratic compares with and without RAJA since that's what MARBL mainly uses.
|
Nice @michaelmckinsey1 ! What would be neat is to add a batching parameter and allow for different batch sizes, maybe 4 is good GPU X or 7 is good for GPU Y type of thing |
I think he means different batch sizes as tunings. |
yes, sounds good to me! |
| if (valid_e) { | ||
| MASS3DPA_1 | ||
| } |
There was a problem hiding this comment.
Instead of doing the check for valid_e inside of every loop what happens if you do a single check if(!valid_e) return; at the beginning? That's how we have it implemented in MFEM.
| } | ||
| } | ||
| __syncthreads(); | ||
| GPU_FOREACH_THREAD_INC(dy, y, mpa::D1D, mpa::Q1D) { |
There was a problem hiding this comment.
It would be interesting to compare having this be a for loop vs. doing a simple if statement check (blockDim.x and blockDim.y should always be >= Q1D).
This seems to help with the hip compiler.
Summary
MASS3DPAimplementation to use element batching as was done in the reference implementation a few months ago Add element batching capabilities to 3D MassIntegrator mfem/mfem#5299RAJA_CUDAvs Laghos with mfem+raja, the runtimes forMASS3DPAandSmemPAMassApply3D_Elementare within 0.8% difference, however MASS3DPA is doing 10% less instructions. Maybe it is my configuration of Q1D/D1D?