Add override for muladd and use LLVM intrinsic for fma#3078
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3078 +/- ##
=======================================
Coverage 90.31% 90.32%
=======================================
Files 141 141
Lines 12165 12165
=======================================
+ Hits 10987 10988 +1
+ Misses 1178 1177 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 814c861 | Previous: 5065018 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100847 ns |
101779 ns |
0.99 |
array/accumulate/Float32/dims=1 |
75951 ns |
76943 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1583533.5 ns |
1585563 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143143.5 ns |
144109 ns |
0.99 |
array/accumulate/Float32/dims=2L |
657140 ns |
658119 ns |
1.00 |
array/accumulate/Int64/1d |
117891 ns |
118854 ns |
0.99 |
array/accumulate/Int64/dims=1 |
79528 ns |
80389 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1694092 ns |
1694708 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155382.5 ns |
156166 ns |
0.99 |
array/accumulate/Int64/dims=2L |
961061 ns |
961594 ns |
1.00 |
array/broadcast |
20462 ns |
20512 ns |
1.00 |
array/construct |
1352.1 ns |
1320.1 ns |
1.02 |
array/copy |
18731 ns |
18914 ns |
0.99 |
array/copyto!/cpu_to_gpu |
213171 ns |
214662 ns |
0.99 |
array/copyto!/gpu_to_cpu |
282003.5 ns |
285022 ns |
0.99 |
array/copyto!/gpu_to_gpu |
11363 ns |
11357 ns |
1.00 |
array/iteration/findall/bool |
131691 ns |
132106 ns |
1.00 |
array/iteration/findall/int |
148303.5 ns |
148958 ns |
1.00 |
array/iteration/findfirst/bool |
81780.5 ns |
82321 ns |
0.99 |
array/iteration/findfirst/int |
84032 ns |
84247.5 ns |
1.00 |
array/iteration/findmin/1d |
87368.5 ns |
85453 ns |
1.02 |
array/iteration/findmin/2d |
117082 ns |
117211 ns |
1.00 |
array/iteration/logical |
201176 ns |
200370 ns |
1.00 |
array/iteration/scalar |
67299 ns |
69501 ns |
0.97 |
array/permutedims/2d |
52326 ns |
52561 ns |
1.00 |
array/permutedims/3d |
52645 ns |
53080 ns |
0.99 |
array/permutedims/4d |
51785.5 ns |
51880.5 ns |
1.00 |
array/random/rand/Float32 |
13124 ns |
12980 ns |
1.01 |
array/random/rand/Int64 |
30189.5 ns |
29766 ns |
1.01 |
array/random/rand!/Float32 |
8508.666666666666 ns |
8371 ns |
1.02 |
array/random/rand!/Int64 |
34205 ns |
33996 ns |
1.01 |
array/random/randn/Float32 |
43914 ns |
38221 ns |
1.15 |
array/random/randn!/Float32 |
31513 ns |
31486 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
35291 ns |
35499.5 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
39969 ns |
39750 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
52035 ns |
51923 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56962.5 ns |
56902 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69588 ns |
70119.5 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
43686 ns |
43301 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
47165 ns |
53296 ns |
0.88 |
array/reductions/mapreduce/Int64/dims=1L |
87717 ns |
87758 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
60012 ns |
60300 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
85240 ns |
85288 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35267 ns |
35511 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
44657.5 ns |
49667 ns |
0.90 |
array/reductions/reduce/Float32/dims=1L |
52122 ns |
52100 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
57115 ns |
56986 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70579 ns |
70384 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43446 ns |
43476 ns |
1.00 |
array/reductions/reduce/Int64/dims=1 |
45066 ns |
42765.5 ns |
1.05 |
array/reductions/reduce/Int64/dims=1L |
87840 ns |
87783 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
60124 ns |
59896 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
85034 ns |
85038 ns |
1.00 |
array/reverse/1d |
18336 ns |
18408 ns |
1.00 |
array/reverse/1dL |
68941 ns |
69079 ns |
1.00 |
array/reverse/1dL_inplace |
65876 ns |
65864 ns |
1.00 |
array/reverse/1d_inplace |
10287.666666666666 ns |
10281.333333333334 ns |
1.00 |
array/reverse/2d |
20736 ns |
20739 ns |
1.00 |
array/reverse/2dL |
72768 ns |
72716 ns |
1.00 |
array/reverse/2dL_inplace |
66247 ns |
65932 ns |
1.00 |
array/reverse/2d_inplace |
10422 ns |
10105.5 ns |
1.03 |
array/sorting/1d |
2734748 ns |
2734769 ns |
1.00 |
array/sorting/2d |
1068705 ns |
1076074 ns |
0.99 |
array/sorting/by |
3303594 ns |
3327052 ns |
0.99 |
cuda/synchronization/context/auto |
1136.8 ns |
1184.7 ns |
0.96 |
cuda/synchronization/context/blocking |
889.76 ns |
940.969696969697 ns |
0.95 |
cuda/synchronization/context/nonblocking |
7224.1 ns |
8469.4 ns |
0.85 |
cuda/synchronization/stream/auto |
984.8125 ns |
1021.1 ns |
0.96 |
cuda/synchronization/stream/blocking |
788.1730769230769 ns |
832.7631578947369 ns |
0.95 |
cuda/synchronization/stream/nonblocking |
8193.9 ns |
7582.5 ns |
1.08 |
integration/byval/reference |
144022 ns |
144056 ns |
1.00 |
integration/byval/slices=1 |
146092 ns |
145820 ns |
1.00 |
integration/byval/slices=2 |
284821 ns |
284683 ns |
1.00 |
integration/byval/slices=3 |
423374 ns |
423445 ns |
1.00 |
integration/cudadevrt |
102584 ns |
102525.5 ns |
1.00 |
integration/volumerhs |
23421182.5 ns |
23421325.5 ns |
1.00 |
kernel/indexing |
13352 ns |
13267 ns |
1.01 |
kernel/indexing_checked |
14038 ns |
14039 ns |
1.00 |
kernel/launch |
2088.7 ns |
2141.4444444444443 ns |
0.98 |
kernel/occupancy |
711.4423076923077 ns |
695.4267515923567 ns |
1.02 |
kernel/rand |
17657.5 ns |
15508 ns |
1.14 |
latency/import |
3803701756.5 ns |
3810464780.5 ns |
1.00 |
latency/precompile |
4590644729 ns |
4591174243 ns |
1.00 |
latency/ttfp |
4388245288 ns |
4390561943.5 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Member
Sorry, no. This was probably an artifact of parsing what's in libdevice and wrapping everything, instead of a specific choice. |
d0db619 to
c5f61ee
Compare
maleadt
reviewed
Apr 14, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
87e9632 to
814c861
Compare
maleadt
approved these changes
Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since Julia 0.7 (JuliaLang/julia#22262) we are emitting
muladd(a,b,c)not asllvm.fmuladd, but rather as a sequence of:The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).
@efaulhaber had an example where
and LLVM helpfully performed some code motion:
Leading to a torn
contractpairManually using a fma leads to a performance improvement of
2.908to2.765so ~5% faster.I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.
@maleadt do you recall why we are using
__nv_fma? LLVM should be able to perform better optimization over thellvm.fmaintrinsic.