Add override for muladd and use LLVM intrinsic for fma by vchuravy · Pull Request #3078 · JuliaGPU/CUDA.jl

vchuravy · 2026-04-02T08:27:02Z

Since Julia 0.7 (JuliaLang/julia#22262) we are emitting

muladd(a,b,c) not as llvm.fmuladd, but rather as a sequence of:

%t = fmul contract %a %b
%r = fadd contract %t %c

The reason for that is vectorization of a potential reduction
(which is something we ought to investigate in Base if it is still worthwhile).

@efaulhaber had an example where

h = ...
for ...
  distance ...
  muladd(epsilon, h^2, distance^2)

and LLVM helpfully performed some code motion:

h = ...
t = mul(epsilon, h^2)
for ...
   add(t, distance^2)

Leading to a torn contract pair

%69 = fmul contract float %"f::#parallel_foreach##10#parallel_foreach##11.fca.0.4.5.2.extract", %68
br label %L619, !dbg !197
;...
 %141 = fadd contract float %69, %140, !dbg !459

Manually using a fma leads to a performance improvement of 2.908 to 2.765 so ~5% faster.

I believe that the motivation in Base is not valid on GPUs since we benefit much more from the emission of fma,
in contrast to reduction vectorization.

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

codecov · 2026-04-02T11:12:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.32%. Comparing base (5065018) to head (814c861).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3078   +/-   ##
=======================================
  Coverage   90.31%   90.32%           
=======================================
  Files         141      141           
  Lines       12165    12165           
=======================================
+ Hits        10987    10988    +1     
+ Misses       1178     1177    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `814c861`	Previous: `5065018`	Ratio
`array/accumulate/Float32/1d`	`100847` ns	`101779` ns	`0.99`
`array/accumulate/Float32/dims=1`	`75951` ns	`76943` ns	`0.99`
`array/accumulate/Float32/dims=1L`	`1583533.5` ns	`1585563` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143143.5` ns	`144109` ns	`0.99`
`array/accumulate/Float32/dims=2L`	`657140` ns	`658119` ns	`1.00`
`array/accumulate/Int64/1d`	`117891` ns	`118854` ns	`0.99`
`array/accumulate/Int64/dims=1`	`79528` ns	`80389` ns	`0.99`
`array/accumulate/Int64/dims=1L`	`1694092` ns	`1694708` ns	`1.00`
`array/accumulate/Int64/dims=2`	`155382.5` ns	`156166` ns	`0.99`
`array/accumulate/Int64/dims=2L`	`961061` ns	`961594` ns	`1.00`
`array/broadcast`	`20462` ns	`20512` ns	`1.00`
`array/construct`	`1352.1` ns	`1320.1` ns	`1.02`
`array/copy`	`18731` ns	`18914` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`213171` ns	`214662` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`282003.5` ns	`285022` ns	`0.99`
`array/copyto!/gpu_to_gpu`	`11363` ns	`11357` ns	`1.00`
`array/iteration/findall/bool`	`131691` ns	`132106` ns	`1.00`
`array/iteration/findall/int`	`148303.5` ns	`148958` ns	`1.00`
`array/iteration/findfirst/bool`	`81780.5` ns	`82321` ns	`0.99`
`array/iteration/findfirst/int`	`84032` ns	`84247.5` ns	`1.00`
`array/iteration/findmin/1d`	`87368.5` ns	`85453` ns	`1.02`
`array/iteration/findmin/2d`	`117082` ns	`117211` ns	`1.00`
`array/iteration/logical`	`201176` ns	`200370` ns	`1.00`
`array/iteration/scalar`	`67299` ns	`69501` ns	`0.97`
`array/permutedims/2d`	`52326` ns	`52561` ns	`1.00`
`array/permutedims/3d`	`52645` ns	`53080` ns	`0.99`
`array/permutedims/4d`	`51785.5` ns	`51880.5` ns	`1.00`
`array/random/rand/Float32`	`13124` ns	`12980` ns	`1.01`
`array/random/rand/Int64`	`30189.5` ns	`29766` ns	`1.01`
`array/random/rand!/Float32`	`8508.666666666666` ns	`8371` ns	`1.02`
`array/random/rand!/Int64`	`34205` ns	`33996` ns	`1.01`
`array/random/randn/Float32`	`43914` ns	`38221` ns	`1.15`
`array/random/randn!/Float32`	`31513` ns	`31486` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`35291` ns	`35499.5` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1`	`39969` ns	`39750` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`52035` ns	`51923` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56962.5` ns	`56902` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`69588` ns	`70119.5` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`43686` ns	`43301` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=1`	`47165` ns	`53296` ns	`0.88`
`array/reductions/mapreduce/Int64/dims=1L`	`87717` ns	`87758` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`60012` ns	`60300` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`85240` ns	`85288` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`35267` ns	`35511` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`44657.5` ns	`49667` ns	`0.90`
`array/reductions/reduce/Float32/dims=1L`	`52122` ns	`52100` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`57115` ns	`56986` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`70579` ns	`70384` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`43446` ns	`43476` ns	`1.00`
`array/reductions/reduce/Int64/dims=1`	`45066` ns	`42765.5` ns	`1.05`
`array/reductions/reduce/Int64/dims=1L`	`87840` ns	`87783` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`60124` ns	`59896` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`85034` ns	`85038` ns	`1.00`
`array/reverse/1d`	`18336` ns	`18408` ns	`1.00`
`array/reverse/1dL`	`68941` ns	`69079` ns	`1.00`
`array/reverse/1dL_inplace`	`65876` ns	`65864` ns	`1.00`
`array/reverse/1d_inplace`	`10287.666666666666` ns	`10281.333333333334` ns	`1.00`
`array/reverse/2d`	`20736` ns	`20739` ns	`1.00`
`array/reverse/2dL`	`72768` ns	`72716` ns	`1.00`
`array/reverse/2dL_inplace`	`66247` ns	`65932` ns	`1.00`
`array/reverse/2d_inplace`	`10422` ns	`10105.5` ns	`1.03`
`array/sorting/1d`	`2734748` ns	`2734769` ns	`1.00`
`array/sorting/2d`	`1068705` ns	`1076074` ns	`0.99`
`array/sorting/by`	`3303594` ns	`3327052` ns	`0.99`
`cuda/synchronization/context/auto`	`1136.8` ns	`1184.7` ns	`0.96`
`cuda/synchronization/context/blocking`	`889.76` ns	`940.969696969697` ns	`0.95`
`cuda/synchronization/context/nonblocking`	`7224.1` ns	`8469.4` ns	`0.85`
`cuda/synchronization/stream/auto`	`984.8125` ns	`1021.1` ns	`0.96`
`cuda/synchronization/stream/blocking`	`788.1730769230769` ns	`832.7631578947369` ns	`0.95`
`cuda/synchronization/stream/nonblocking`	`8193.9` ns	`7582.5` ns	`1.08`
`integration/byval/reference`	`144022` ns	`144056` ns	`1.00`
`integration/byval/slices=1`	`146092` ns	`145820` ns	`1.00`
`integration/byval/slices=2`	`284821` ns	`284683` ns	`1.00`
`integration/byval/slices=3`	`423374` ns	`423445` ns	`1.00`
`integration/cudadevrt`	`102584` ns	`102525.5` ns	`1.00`
`integration/volumerhs`	`23421182.5` ns	`23421325.5` ns	`1.00`
`kernel/indexing`	`13352` ns	`13267` ns	`1.01`
`kernel/indexing_checked`	`14038` ns	`14039` ns	`1.00`
`kernel/launch`	`2088.7` ns	`2141.4444444444443` ns	`0.98`
`kernel/occupancy`	`711.4423076923077` ns	`695.4267515923567` ns	`1.02`
`kernel/rand`	`17657.5` ns	`15508` ns	`1.14`
`latency/import`	`3803701756.5` ns	`3810464780.5` ns	`1.00`
`latency/precompile`	`4590644729` ns	`4591174243` ns	`1.00`
`latency/ttfp`	`4388245288` ns	`4390561943.5` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

maleadt · 2026-04-09T10:39:27Z

@maleadt do you recall why we are using __nv_fma? LLVM should be able to perform better optimization over the llvm.fma intrinsic.

Sorry, no. This was probably an artifact of parsing what's in libdevice and wrapping everything, instead of a specific choice.
That said, doesn't __nv_fma expand to LLVM's FMA after libdevice inlining?

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot reviewed Apr 2, 2026

View reviewed changes

vchuravy requested a review from gbaraldi April 13, 2026 14:08

vchuravy force-pushed the vc/faster_muladd branch from d0db619 to c5f61ee Compare April 14, 2026 07:59

maleadt reviewed Apr 14, 2026

View reviewed changes

Comment thread test/core/codegen.jl

vchuravy and others added 3 commits April 14, 2026 21:46

Add override for muladd and use LLVM intrinsic for fma

4bbf989

Add tests for muladd override and LLVM intrinsic fma/muladd codegen

3a77e1c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

move tests in the right place

814c861

vchuravy force-pushed the vc/faster_muladd branch from 87e9632 to 814c861 Compare April 14, 2026 19:46

maleadt approved these changes Apr 15, 2026

View reviewed changes

maleadt merged commit 5e1edf1 into master Apr 15, 2026
2 checks passed

maleadt deleted the vc/faster_muladd branch April 15, 2026 04:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add override for muladd and use LLVM intrinsic for fma#3078

Add override for muladd and use LLVM intrinsic for fma#3078
maleadt merged 3 commits intomasterfrom
vc/faster_muladd

vchuravy commented Apr 2, 2026

Uh oh!

codecov Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

maleadt commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vchuravy commented Apr 2, 2026

Uh oh!

codecov Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 2, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading