3x Speedup on GPUs: Checklist

With the help of @vchuravy at the TRUDI 2026 hackathon, I managed to get a ~3x speedup on GPUs.
Here is a benchmark of the fluid-fluid interaction with these optimizations. This is using 4M particles in 2D and 1M in 3D, all with basic WCSPH without density diffusion. The particles use the randomized positions from the benchmark in PointNeighbors.jl and are not located on an unrealistic perfectly ordered grid. They are sorted by cell index, like obtained after applying the `SortingCallback`.
"without `div_fast`" is showing the results without #1128.

| Machine                  | main 2D      | main 3D      | all optimizations 2D      | all optimizations 3D      | without div_fast 2D      | without div_fast 3D      |
|--------------------------|--------------|--------------|--------------------------|--------------------------|--------------------------|--------------------------|
| AMD EPYC 9654 (96-core CPU) | 31.404 ms    | 42.491 ms    | 23.118 ms (1.36×)        | 32.707 ms (1.30×)        | 22.744 ms (1.38×)        | 32.943 ms (1.29×)        |
| Nvidia H100 FP32         | 4.441 ms     | 9.066 ms     | 1.652 ms (2.69×)         | 2.906 ms (3.12×)         | 2.347 ms (1.89×)         | 4.651 ms (1.95×)         |
| Nvidia H100 FP64         | 8.574 ms     | 18.202 ms    | 2.884 ms (2.97×)         | 5.087 ms (3.58×)         | 4.376 ms (1.96×)         | 8.486 ms (2.15×)         |
| Nvidia RTX A4500 FP32    | 11.840 ms    | 23.439 ms    | 4.168 ms (2.84×)         | 7.699 ms (3.04×)         | 5.892 ms (2.01×)         | 10.641 ms (2.20×)        |
| AMD Instinct MI300A FP32 | 4.414 ms | 9.430 ms | 1.983 ms (2.23x)             | 4.397 ms (2.15x)             | 2.303 ms (1.92x)            | 5.015 ms (1.88x)            | 
| AMD Instinct MI300A FP64 | 6.430 ms | 15.047 ms| 2.961 ms (2.17x)             | 6.531 ms (2.30x)             | 3.272 ms (1.97x)            | 7.174 ms (2.10x)            |   
| AMD Instinct MI300A FP32 | 4.414 ms | 9.430 ms | 1.966 ms (2.25x)             | 4.403 ms (2.14x)             | 2.302 ms (1.92x)            | 4.990 ms (1.89x)            |
| AMD Instinct MI300A FP64 | 6.430 ms | 15.047 ms| 2.942 ms (2.19x)             | 6.379 ms (2.36x)             | 3.223 ms (2.00x)            | 7.032 ms (2.14x)            |


This branch contains all optimizations: https://github.com/efaulhaber/TrixiParticles.jl/tree/performance-fluid-tmp
In order to make this reviewable, I split these into several smaller PRs:
- [x] #1128 
- [x] #1117 
- [x] #1124 
- [x] #1125
- [x] #1130 
- [x] #1116
- [ ] #1147

I also did the same for the tlsph-tlsph RHS and the TLSPH deformation gradient.
- [x] #1139
- [ ] #1149

Here are the benchmarks for that with the same problem sizes on an H100. The RHS uses penalty force and artificial viscosity and a smoothing length factor of 1.5 (instead of a realistic sqrt(2) because I didn't want to run the benchmarks again). The `vloada` optimization is combining the 2x2 matrix load in 2D into a single wide load (#1147). In the last two columns, I tried to apply this to 3D as well, but in order for this to work, I had to pad the 3x3 matrix to a length of 16 (only powers of 2 are allowed) and then I tried either pulling the whole thing or the first 8 plus a single load for the last one. This is slightly faster with FP32, but significantly slower with FP64.

|                | main 2D      | main 3D      | all optimizations 2D | all optimizations 3D | without vloada 2D | vloada 16 3D      | vloada 8 + 1 3D   |
|----------------|--------------|--------------|----------------------|----------------------|-------------------|-------------------|------------------|
| RHS FP32       | 1.986 ms     | 4.591 ms     | 1.209 ms (1.64x)     | 3.012 ms (1.52x)     | 1.472 ms (1.35x)  | 2.907 ms (1.58x)  | 2.885 ms (1.59x) |
| RHS FP64       | 4.178 ms     | 7.595 ms     | 2.337 ms (1.79x)     | 5.061 ms (1.50x)     | 2.736 ms (1.53x)  | 7.140 ms (1.06x)  | 6.017 ms (1.26x) |
| PK1 FP32       | 1.384 ms     | 7.948 ms     | 0.632 ms (2.19x)     | 3.128 ms (2.54x)     |                   |                   |                  |
| PK1 FP64       | 2.604 ms     | 16.499 ms    | 1.048 ms (2.48x)     | 5.904 ms (2.79x)     |                   |                   |                  |


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3x Speedup on GPUs: Checklist #1131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine	main 2D	main 3D	all optimizations 2D	all optimizations 3D	without div_fast 2D	without div_fast 3D
AMD EPYC 9654 (96-core CPU)	31.404 ms	42.491 ms	23.118 ms (1.36×)	32.707 ms (1.30×)	22.744 ms (1.38×)	32.943 ms (1.29×)
Nvidia H100 FP32	4.441 ms	9.066 ms	1.652 ms (2.69×)	2.906 ms (3.12×)	2.347 ms (1.89×)	4.651 ms (1.95×)
Nvidia H100 FP64	8.574 ms	18.202 ms	2.884 ms (2.97×)	5.087 ms (3.58×)	4.376 ms (1.96×)	8.486 ms (2.15×)
Nvidia RTX A4500 FP32	11.840 ms	23.439 ms	4.168 ms (2.84×)	7.699 ms (3.04×)	5.892 ms (2.01×)	10.641 ms (2.20×)
AMD Instinct MI300A FP32	4.414 ms	9.430 ms	1.983 ms (2.23x)	4.397 ms (2.15x)	2.303 ms (1.92x)	5.015 ms (1.88x)
AMD Instinct MI300A FP64	6.430 ms	15.047 ms	2.961 ms (2.17x)	6.531 ms (2.30x)	3.272 ms (1.97x)	7.174 ms (2.10x)
AMD Instinct MI300A FP32	4.414 ms	9.430 ms	1.966 ms (2.25x)	4.403 ms (2.14x)	2.302 ms (1.92x)	4.990 ms (1.89x)
AMD Instinct MI300A FP64	6.430 ms	15.047 ms	2.942 ms (2.19x)	6.379 ms (2.36x)	3.223 ms (2.00x)	7.032 ms (2.14x)

	main 2D	main 3D	all optimizations 2D	all optimizations 3D	without vloada 2D	vloada 16 3D	vloada 8 + 1 3D
RHS FP32	1.986 ms	4.591 ms	1.209 ms (1.64x)	3.012 ms (1.52x)	1.472 ms (1.35x)	2.907 ms (1.58x)	2.885 ms (1.59x)
RHS FP64	4.178 ms	7.595 ms	2.337 ms (1.79x)	5.061 ms (1.50x)	2.736 ms (1.53x)	7.140 ms (1.06x)	6.017 ms (1.26x)
PK1 FP32	1.384 ms	7.948 ms	0.632 ms (2.19x)	3.128 ms (2.54x)
PK1 FP64	2.604 ms	16.499 ms	1.048 ms (2.48x)	5.904 ms (2.79x)

3x Speedup on GPUs: Checklist #1131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions