Skip to content

[WIP] MASS3DPA element batching#693

Draft
michaelmckinsey1 wants to merge 2 commits into
developfrom
mass3dpa-batching
Draft

[WIP] MASS3DPA element batching#693
michaelmckinsey1 wants to merge 2 commits into
developfrom
mass3dpa-batching

Conversation

@michaelmckinsey1

@michaelmckinsey1 michaelmckinsey1 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • This PR is an update for MASS3DPA
  • It updates the MASS3DPA implementation to use element batching as was done in the reference implementation a few months ago Add element batching capabilities to 3D MassIntegrator mfem/mfem#5299
  • Testing RAJA_CUDA vs Laghos with mfem+raja, the runtimes for MASS3DPA and SmemPAMassApply3D_Element are within 0.8% difference, however MASS3DPA is doing 10% less instructions. Maybe it is my configuration of Q1D/D1D?
  • Need to update/test the HIP version and CPU
  • Test Base_CUDA?
  • Update MASSVEC3DPA?
  • Batch size tunings

@michaelmckinsey1 michaelmckinsey1 self-assigned this Jun 19, 2026
Comment thread src/apps/MASS3DPA.hpp
Comment on lines +175 to +190
// linear
// constexpr RAJA::Index_type D1D = 2;
// constexpr RAJA::Index_type Q1D = 2;
// }
// quadratic
// constexpr RAJA::Index_type D1D = 4;
// constexpr RAJA::Index_type Q1D = 4;
// cubic
// constexpr RAJA::Index_type D1D = 6;
// constexpr RAJA::Index_type Q1D = 6;
constexpr RAJA::Index_type D1D = 2;
constexpr RAJA::Index_type Q1D = 2;

constexpr RAJA::Index_type TBATCH = 16; // linear
// constexpr RAJA::Index_type TBATCH = 2; // quadratic
// constexpr RAJA::Index_type TBATCH = 1; // cubic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my observation of the values via Laghos. Looking for feedback here and whether we should try to have different configurations of MASS3DPA for different orders?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First what do we want to cover? If we want to do more, I have thought about templating entire kernels to handle things like different orders and maybe using explicit template instantiation to keep the instantiation in separate files similar.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be helpful to see how the performance of quadratic compares with and without RAJA since that's what MARBL mainly uses.

@artv3

artv3 commented Jun 19, 2026

Copy link
Copy Markdown
Member

Nice @michaelmckinsey1 ! What would be neat is to add a batching parameter and allow for different batch sizes, maybe 4 is good GPU X or 7 is good for GPU Y type of thing

@MrBurmark

Copy link
Copy Markdown
Member

Nice @michaelmckinsey1 ! What would be neat is to add a batching parameter and allow for different batch sizes, maybe 4 is good GPU X or 7 is good for GPU Y type of thing

I think he means different batch sizes as tunings.

@artv3

artv3 commented Jun 19, 2026

Copy link
Copy Markdown
Member

Nice @michaelmckinsey1 ! What would be neat is to add a batching parameter and allow for different batch sizes, maybe 4 is good GPU X or 7 is good for GPU Y type of thing

I think he means different batch sizes as tunings.

yes, sounds good to me!

Comment thread src/apps/MASS3DPA-Cuda.cpp Outdated
Comment on lines +41 to +43
if (valid_e) {
MASS3DPA_1
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing the check for valid_e inside of every loop what happens if you do a single check if(!valid_e) return; at the beginning? That's how we have it implemented in MFEM.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See efaa790

}
}
__syncthreads();
GPU_FOREACH_THREAD_INC(dy, y, mpa::D1D, mpa::Q1D) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to compare having this be a for loop vs. doing a simple if statement check (blockDim.x and blockDim.y should always be >= Q1D).

This seems to help with the hip compiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants