Conversation
|
I made this table to see what is happening with FP32 and FP64:
|
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 0c041fa | Previous: 6ccd4b4 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100964 ns |
101723 ns |
0.99 |
array/accumulate/Float32/dims=1 |
76402 ns |
76608 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1583852 ns |
1585294 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143663.5 ns |
143948 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657387 ns |
657945 ns |
1.00 |
array/accumulate/Int64/1d |
118535 ns |
118967 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79793.5 ns |
79956 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1694430 ns |
1694445 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155648 ns |
156040 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961725 ns |
961840 ns |
1.00 |
array/broadcast |
20554 ns |
20347 ns |
1.01 |
array/construct |
1290.5 ns |
1311.9 ns |
0.98 |
array/copy |
19018 ns |
18931 ns |
1.00 |
array/copyto!/cpu_to_gpu |
218646 ns |
215113 ns |
1.02 |
array/copyto!/gpu_to_cpu |
286175.5 ns |
283517 ns |
1.01 |
array/copyto!/gpu_to_gpu |
11459 ns |
11647 ns |
0.98 |
array/iteration/findall/bool |
132313 ns |
132615 ns |
1.00 |
array/iteration/findall/int |
149520 ns |
149623 ns |
1.00 |
array/iteration/findfirst/bool |
82180 ns |
82175 ns |
1.00 |
array/iteration/findfirst/int |
84533 ns |
84437 ns |
1.00 |
array/iteration/findmin/1d |
86113 ns |
87647 ns |
0.98 |
array/iteration/findmin/2d |
117447 ns |
117309 ns |
1.00 |
array/iteration/logical |
200576.5 ns |
203627.5 ns |
0.99 |
array/iteration/scalar |
67222 ns |
68729 ns |
0.98 |
array/permutedims/2d |
52451 ns |
52820 ns |
0.99 |
array/permutedims/3d |
52927 ns |
52914 ns |
1.00 |
array/permutedims/4d |
51947 ns |
51983 ns |
1.00 |
array/random/rand/Float32 |
13137 ns |
13104 ns |
1.00 |
array/random/rand/Int64 |
30364 ns |
37312 ns |
0.81 |
array/random/rand!/Float32 |
8540.666666666666 ns |
8603.333333333334 ns |
0.99 |
array/random/rand!/Int64 |
34445.5 ns |
34156 ns |
1.01 |
array/random/randn/Float32 |
38562 ns |
38723.5 ns |
1.00 |
array/random/randn!/Float32 |
31437 ns |
31520 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34900.5 ns |
35427 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
44447 ns |
49562 ns |
0.90 |
array/reductions/mapreduce/Float32/dims=1L |
51989 ns |
51766 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56946 ns |
56838 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69751.5 ns |
69604 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
43038 ns |
43423 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
42447.5 ns |
44694.5 ns |
0.95 |
array/reductions/mapreduce/Int64/dims=1L |
87748 ns |
87805 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59719 ns |
60051.5 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
84965.5 ns |
85186 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35216 ns |
35458 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
45397.5 ns |
46307.5 ns |
0.98 |
array/reductions/reduce/Float32/dims=1L |
52095 ns |
52046 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
57168 ns |
57117 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70192 ns |
70127.5 ns |
1.00 |
array/reductions/reduce/Int64/1d |
42924 ns |
41220 ns |
1.04 |
array/reductions/reduce/Int64/dims=1 |
42493 ns |
51895 ns |
0.82 |
array/reductions/reduce/Int64/dims=1L |
87747 ns |
87739 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59737 ns |
59630 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84867 ns |
84743.5 ns |
1.00 |
array/reverse/1d |
18371 ns |
18349 ns |
1.00 |
array/reverse/1dL |
68939.5 ns |
68960 ns |
1.00 |
array/reverse/1dL_inplace |
65976.5 ns |
65909 ns |
1.00 |
array/reverse/1d_inplace |
10268.666666666666 ns |
8540.833333333332 ns |
1.20 |
array/reverse/2d |
20918 ns |
20881 ns |
1.00 |
array/reverse/2dL |
73031 ns |
72996 ns |
1.00 |
array/reverse/2dL_inplace |
66004 ns |
65926 ns |
1.00 |
array/reverse/2d_inplace |
11200 ns |
10076 ns |
1.11 |
array/sorting/1d |
2735406.5 ns |
2735188.5 ns |
1.00 |
array/sorting/2d |
1068948 ns |
1069206 ns |
1.00 |
array/sorting/by |
3304547 ns |
3304125.5 ns |
1.00 |
cuda/synchronization/context/auto |
1187.9 ns |
1176.2 ns |
1.01 |
cuda/synchronization/context/blocking |
943.6333333333333 ns |
924.5869565217391 ns |
1.02 |
cuda/synchronization/context/nonblocking |
7155.6 ns |
6942.1 ns |
1.03 |
cuda/synchronization/stream/auto |
1042.3 ns |
999.9375 ns |
1.04 |
cuda/synchronization/stream/blocking |
846.5555555555555 ns |
787.7961165048544 ns |
1.07 |
cuda/synchronization/stream/nonblocking |
8277.4 ns |
7168.6 ns |
1.15 |
integration/byval/reference |
144027.5 ns |
143982 ns |
1.00 |
integration/byval/slices=1 |
145912 ns |
145868 ns |
1.00 |
integration/byval/slices=2 |
284580 ns |
284528 ns |
1.00 |
integration/byval/slices=3 |
423020 ns |
422970 ns |
1.00 |
integration/cudadevrt |
102567 ns |
102612 ns |
1.00 |
integration/volumerhs |
23411584 ns |
9440461 ns |
2.48 |
kernel/indexing |
13408 ns |
13181 ns |
1.02 |
kernel/indexing_checked |
14088 ns |
14081 ns |
1.00 |
kernel/launch |
2213.6666666666665 ns |
2150.777777777778 ns |
1.03 |
kernel/occupancy |
715.4710144927536 ns |
672 ns |
1.06 |
kernel/rand |
14495 ns |
14396 ns |
1.01 |
latency/import |
3831674700.5 ns |
3814290062.5 ns |
1.00 |
latency/precompile |
4590038439 ns |
4590207670.5 ns |
1.00 |
latency/ttfp |
4407587085.5 ns |
4409319020 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
|
@efaulhaber that was on a H100 right? |
|
This is great. How did you get the benchmark numbers? |
This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations. |
Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3077 +/- ##
=======================================
Coverage 90.43% 90.43%
=======================================
Files 141 141
Lines 12025 12025
=======================================
Hits 10875 10875
Misses 1150 1150 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.
On top of that currently
@fastmath a/bfor Float64 just becomes afdiv fastwhich then becomes a normal NVPTX division,and SASS helpfully turns into a function call.
@efaulhaber has some numbers for his hot kernel:
Using the simple implementation of a/b = a * 1/b:
did speed his code up, but that might be more to do with additional code motion opportunity
this affords.
As an example NVIDIA warp
uses the
approx.ftzinstruction to obtain afast_divimplementation.Which using @efaulhaber measurements:
But what is the loss of accuracy we are incurring here?
Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the
inv_fastby performing an additional iteration.@efaulhaber tested this as:
So a very small additional cost.
But the gain in accuracy is significant: