Skip to content

Add override for muladd and use LLVM intrinsic for fma#3078

Merged
maleadt merged 3 commits intomasterfrom
vc/faster_muladd
Apr 15, 2026
Merged

Add override for muladd and use LLVM intrinsic for fma#3078
maleadt merged 3 commits intomasterfrom
vc/faster_muladd

Conversation

@vchuravy
Copy link
Copy Markdown
Member

@vchuravy vchuravy commented Apr 2, 2026

Since Julia 0.7 (JuliaLang/julia#22262) we are emitting

muladd(a,b,c) not as llvm.fmuladd, but rather as a sequence of:

%t = fmul contract %a %b
%r = fadd contract %t %c

The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).

@efaulhaber had an example where

h = ...
for ...
  distance ...
  muladd(epsilon, h^2, distance^2)

and LLVM helpfully performed some code motion:

h = ...
t = mul(epsilon, h^2)
for ...
   add(t, distance^2)

Leading to a torn contract pair

%69 = fmul contract float %"f::#parallel_foreach##10#parallel_foreach##11.fca.0.4.5.2.extract", %68
br label %L619, !dbg !197
;...
 %141 = fadd contract float %69, %140, !dbg !459

Manually using a fma leads to a performance improvement of 2.908 to 2.765 so ~5% faster.

I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.32%. Comparing base (5065018) to head (814c861).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3078   +/-   ##
=======================================
  Coverage   90.31%   90.32%           
=======================================
  Files         141      141           
  Lines       12165    12165           
=======================================
+ Hits        10987    10988    +1     
+ Misses       1178     1177    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 814c861 Previous: 5065018 Ratio
array/accumulate/Float32/1d 100847 ns 101779 ns 0.99
array/accumulate/Float32/dims=1 75951 ns 76943 ns 0.99
array/accumulate/Float32/dims=1L 1583533.5 ns 1585563 ns 1.00
array/accumulate/Float32/dims=2 143143.5 ns 144109 ns 0.99
array/accumulate/Float32/dims=2L 657140 ns 658119 ns 1.00
array/accumulate/Int64/1d 117891 ns 118854 ns 0.99
array/accumulate/Int64/dims=1 79528 ns 80389 ns 0.99
array/accumulate/Int64/dims=1L 1694092 ns 1694708 ns 1.00
array/accumulate/Int64/dims=2 155382.5 ns 156166 ns 0.99
array/accumulate/Int64/dims=2L 961061 ns 961594 ns 1.00
array/broadcast 20462 ns 20512 ns 1.00
array/construct 1352.1 ns 1320.1 ns 1.02
array/copy 18731 ns 18914 ns 0.99
array/copyto!/cpu_to_gpu 213171 ns 214662 ns 0.99
array/copyto!/gpu_to_cpu 282003.5 ns 285022 ns 0.99
array/copyto!/gpu_to_gpu 11363 ns 11357 ns 1.00
array/iteration/findall/bool 131691 ns 132106 ns 1.00
array/iteration/findall/int 148303.5 ns 148958 ns 1.00
array/iteration/findfirst/bool 81780.5 ns 82321 ns 0.99
array/iteration/findfirst/int 84032 ns 84247.5 ns 1.00
array/iteration/findmin/1d 87368.5 ns 85453 ns 1.02
array/iteration/findmin/2d 117082 ns 117211 ns 1.00
array/iteration/logical 201176 ns 200370 ns 1.00
array/iteration/scalar 67299 ns 69501 ns 0.97
array/permutedims/2d 52326 ns 52561 ns 1.00
array/permutedims/3d 52645 ns 53080 ns 0.99
array/permutedims/4d 51785.5 ns 51880.5 ns 1.00
array/random/rand/Float32 13124 ns 12980 ns 1.01
array/random/rand/Int64 30189.5 ns 29766 ns 1.01
array/random/rand!/Float32 8508.666666666666 ns 8371 ns 1.02
array/random/rand!/Int64 34205 ns 33996 ns 1.01
array/random/randn/Float32 43914 ns 38221 ns 1.15
array/random/randn!/Float32 31513 ns 31486 ns 1.00
array/reductions/mapreduce/Float32/1d 35291 ns 35499.5 ns 0.99
array/reductions/mapreduce/Float32/dims=1 39969 ns 39750 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 52035 ns 51923 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56962.5 ns 56902 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69588 ns 70119.5 ns 0.99
array/reductions/mapreduce/Int64/1d 43686 ns 43301 ns 1.01
array/reductions/mapreduce/Int64/dims=1 47165 ns 53296 ns 0.88
array/reductions/mapreduce/Int64/dims=1L 87717 ns 87758 ns 1.00
array/reductions/mapreduce/Int64/dims=2 60012 ns 60300 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 85240 ns 85288 ns 1.00
array/reductions/reduce/Float32/1d 35267 ns 35511 ns 0.99
array/reductions/reduce/Float32/dims=1 44657.5 ns 49667 ns 0.90
array/reductions/reduce/Float32/dims=1L 52122 ns 52100 ns 1.00
array/reductions/reduce/Float32/dims=2 57115 ns 56986 ns 1.00
array/reductions/reduce/Float32/dims=2L 70579 ns 70384 ns 1.00
array/reductions/reduce/Int64/1d 43446 ns 43476 ns 1.00
array/reductions/reduce/Int64/dims=1 45066 ns 42765.5 ns 1.05
array/reductions/reduce/Int64/dims=1L 87840 ns 87783 ns 1.00
array/reductions/reduce/Int64/dims=2 60124 ns 59896 ns 1.00
array/reductions/reduce/Int64/dims=2L 85034 ns 85038 ns 1.00
array/reverse/1d 18336 ns 18408 ns 1.00
array/reverse/1dL 68941 ns 69079 ns 1.00
array/reverse/1dL_inplace 65876 ns 65864 ns 1.00
array/reverse/1d_inplace 10287.666666666666 ns 10281.333333333334 ns 1.00
array/reverse/2d 20736 ns 20739 ns 1.00
array/reverse/2dL 72768 ns 72716 ns 1.00
array/reverse/2dL_inplace 66247 ns 65932 ns 1.00
array/reverse/2d_inplace 10422 ns 10105.5 ns 1.03
array/sorting/1d 2734748 ns 2734769 ns 1.00
array/sorting/2d 1068705 ns 1076074 ns 0.99
array/sorting/by 3303594 ns 3327052 ns 0.99
cuda/synchronization/context/auto 1136.8 ns 1184.7 ns 0.96
cuda/synchronization/context/blocking 889.76 ns 940.969696969697 ns 0.95
cuda/synchronization/context/nonblocking 7224.1 ns 8469.4 ns 0.85
cuda/synchronization/stream/auto 984.8125 ns 1021.1 ns 0.96
cuda/synchronization/stream/blocking 788.1730769230769 ns 832.7631578947369 ns 0.95
cuda/synchronization/stream/nonblocking 8193.9 ns 7582.5 ns 1.08
integration/byval/reference 144022 ns 144056 ns 1.00
integration/byval/slices=1 146092 ns 145820 ns 1.00
integration/byval/slices=2 284821 ns 284683 ns 1.00
integration/byval/slices=3 423374 ns 423445 ns 1.00
integration/cudadevrt 102584 ns 102525.5 ns 1.00
integration/volumerhs 23421182.5 ns 23421325.5 ns 1.00
kernel/indexing 13352 ns 13267 ns 1.01
kernel/indexing_checked 14038 ns 14039 ns 1.00
kernel/launch 2088.7 ns 2141.4444444444443 ns 0.98
kernel/occupancy 711.4423076923077 ns 695.4267515923567 ns 1.02
kernel/rand 17657.5 ns 15508 ns 1.14
latency/import 3803701756.5 ns 3810464780.5 ns 1.00
latency/precompile 4590644729 ns 4591174243 ns 1.00
latency/ttfp 4388245288 ns 4390561943.5 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Apr 9, 2026

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

Sorry, no. This was probably an artifact of parsing what's in libdevice and wrapping everything, instead of a specific choice.
That said, doesn't __nv_fma expand to LLVM's FMA after libdevice inlining?

@vchuravy vchuravy requested a review from gbaraldi April 13, 2026 14:08
Comment thread test/core/codegen.jl
@maleadt maleadt merged commit 5e1edf1 into master Apr 15, 2026
2 checks passed
@maleadt maleadt deleted the vc/faster_muladd branch April 15, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants