Skip to content

3x Speedup on GPUs: Checklist #1131

@efaulhaber

Description

@efaulhaber

With the help of @vchuravy at the TRUDI 2026 hackathon, I managed to get a ~3x speedup on GPUs.
Here is a benchmark of the fluid-fluid interaction with these optimizations. This is using 4M particles in 2D and 1M in 3D, all with basic WCSPH without density diffusion. The particles use the randomized positions from the benchmark in PointNeighbors.jl and are not located on an unrealistic perfectly ordered grid. They are sorted by cell index, like obtained after applying the SortingCallback.
"without div_fast" is showing the results without #1128.

Machine main 2D main 3D all optimizations 2D all optimizations 3D without div_fast 2D without div_fast 3D
AMD EPYC 9654 (96-core CPU) 31.404 ms 42.491 ms 23.118 ms (1.36×) 32.707 ms (1.30×) 22.744 ms (1.38×) 32.943 ms (1.29×)
Nvidia H100 FP32 4.441 ms 9.066 ms 1.652 ms (2.69×) 2.906 ms (3.12×) 2.347 ms (1.89×) 4.651 ms (1.95×)
Nvidia H100 FP64 8.574 ms 18.202 ms 2.884 ms (2.97×) 5.087 ms (3.58×) 4.376 ms (1.96×) 8.486 ms (2.15×)
Nvidia RTX A4500 FP32 11.840 ms 23.439 ms 4.168 ms (2.84×) 7.699 ms (3.04×) 5.892 ms (2.01×) 10.641 ms (2.20×)
AMD Instinct MI300A FP32 4.414 ms 9.430 ms 1.983 ms (2.23x) 4.397 ms (2.15x) 2.303 ms (1.92x) 5.015 ms (1.88x)
AMD Instinct MI300A FP64 6.430 ms 15.047 ms 2.961 ms (2.17x) 6.531 ms (2.30x) 3.272 ms (1.97x) 7.174 ms (2.10x)
AMD Instinct MI300A FP32 4.414 ms 9.430 ms 1.966 ms (2.25x) 4.403 ms (2.14x) 2.302 ms (1.92x) 4.990 ms (1.89x)
AMD Instinct MI300A FP64 6.430 ms 15.047 ms 2.942 ms (2.19x) 6.379 ms (2.36x) 3.223 ms (2.00x) 7.032 ms (2.14x)

This branch contains all optimizations: https://github.com/efaulhaber/TrixiParticles.jl/tree/performance-fluid-tmp
In order to make this reviewable, I split these into several smaller PRs:

I also did the same for the tlsph-tlsph RHS and the TLSPH deformation gradient.

Here are the benchmarks for that with the same problem sizes on an H100. The RHS uses penalty force and artificial viscosity and a smoothing length factor of 1.5 (instead of a realistic sqrt(2) because I didn't want to run the benchmarks again). The vloada optimization is combining the 2x2 matrix load in 2D into a single wide load (#1147). In the last two columns, I tried to apply this to 3D as well, but in order for this to work, I had to pad the 3x3 matrix to a length of 16 (only powers of 2 are allowed) and then I tried either pulling the whole thing or the first 8 plus a single load for the last one. This is slightly faster with FP32, but significantly slower with FP64.

main 2D main 3D all optimizations 2D all optimizations 3D without vloada 2D vloada 16 3D vloada 8 + 1 3D
RHS FP32 1.986 ms 4.591 ms 1.209 ms (1.64x) 3.012 ms (1.52x) 1.472 ms (1.35x) 2.907 ms (1.58x) 2.885 ms (1.59x)
RHS FP64 4.178 ms 7.595 ms 2.337 ms (1.79x) 5.061 ms (1.50x) 2.736 ms (1.53x) 7.140 ms (1.06x) 6.017 ms (1.26x)
PK1 FP32 1.384 ms 7.948 ms 0.632 ms (2.19x) 3.128 ms (2.54x)
PK1 FP64 2.604 ms 16.499 ms 1.048 ms (2.48x) 5.904 ms (2.79x)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions