Implement fast_div using fast rcp by vchuravy · Pull Request #3077 · JuliaGPU/CUDA.jl

vchuravy · 2026-04-02T08:13:04Z

Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.

On top of that currently @fastmath a/b for Float64 just becomes a fdiv fast which then becomes a normal NVPTX division,
and SASS helpfully turns into a function call.

@efaulhaber has some numbers for his hot kernel:

# fdiv double %143, %144, !dbg !528
# 10.159 ms
# return x / y

# fdiv fast double %143, %144, !dbg !530
# 10.114 ms
# return Base.FastMath.div_fast(x, y)

Using the simple implementation of a/b = a * 1/b:

# fdiv double 1.000000e+00, %145, !dbg !530
# fmul double %144, %146, !dbg !533
# 6.878 ms
# return x * (1 / y)

# fdiv double 1.000000e+00, %145, !dbg !532
# fmul double %144, %146, !dbg !535
# 6.852 ms
# return x * inv(y)

did speed his code up, but that might be more to do with additional code motion opportunity
this affords.

As an example NVIDIA warp
uses the approx.ftz instruction to obtain a fast_div implementation.

Which using @efaulhaber measurements:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fmul double %117, %119, !dbg !418
# 4.758 ms
# return x * fast_inv_cuda_nofma(y)

But what is the loss of accuracy we are incurring here?

julia> y2 = CUDA.rand(Float64, 100_000);
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
0.007475190221157391

# Without numbers close to zero
julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1;
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the inv_fast by performing an additional iteration.

@efaulhaber tested this as:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fneg double %118, !dbg !418
# call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
# call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
# call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
# fmul double %117, %123, !dbg !426
# 4.844 ms
# return x * fast_inv_cuda(y)

So a very small additional cost.

But the gain in accuracy is significant:

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
7.105427357601002e-15
julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

vchuravy · 2026-04-02T08:30:52Z

@lcw this might interest you since we were trying things along the line in the volumerhs benchmark

CUDA.jl/perf/volumerhs.jl

Lines 40 to 60 in e200935

    
           let (jlf, f) = (:div_arcp, :div) 
        
               for (T, llvmT) in ((:Float32, "float"), (:Float64, "double")) 
        
                   ir = """ 
        
                       %x = f$f fast $llvmT %0, %1 
        
                       ret $llvmT %x 
        
                   """ 
        
                   @eval begin 
        
                       # the @pure is necessary so that we can constant propagate. 
        
                       @inline Base.@pure function $jlf(a::$T, b::$T) 
        
                           Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b) 
        
                       end 
        
                   end 
        
               end 
        
               @eval function $jlf(args...) 
        
                   Base.$jlf(args...) 
        
               end 
        
           end 
        
           rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call 
        
           # div_fast(x::Float32, y::Float32) = ccall("extern __nv_fast_fdividef", llvmcall, Cfloat, (Cfloat, Cfloat), x, y) 
        
           # rcp(x) = div_fast(one(x), x)

efaulhaber · 2026-04-02T09:29:24Z

I made this table to see what is happening with FP32 and FP64:

Variant	LLVM Operation (FP32)	Runtime (FP32)	LLVM Operation (FP64)	Runtime (FP64)
`x / y`	`fdiv float %143, %144, !dbg !528`	4.894 ms	`fdiv double %143, %144, !dbg !528`	10.159 ms
`x * (1 / y)`	`fdiv float 1.000000e+00, %145, !dbg !530` `fmul float %144, %146, !dbg !533`	4.227 ms	`fdiv double 1.000000e+00, %145, !dbg !530` `fmul double %144, %146, !dbg !533`	6.878 ms
`x * inv(y)`	`call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.620 ms	`fdiv double 1.000000e+00, %145, !dbg !532` `fmul double %144, %146, !dbg !535`	6.852 ms
`x * fast_inv_cuda(y)`	`call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413` `fmul float %117, %119, !dbg !418`	3.281 ms	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fneg double %118, !dbg !418` `call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420` `call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422` `call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424` `fmul double %117, %123, !dbg !426`	4.844 ms
`Base.FastMath.div_fast(x, y)`	`call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413`	3.114 ms	`fdiv fast double %143, %144, !dbg !530`	10.114 ms
`x * fast_inv_cuda_nofma(y)`	`call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413` `fmul double %117, %119, !dbg !418`	—	—	4.758 ms

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `0c041fa`	Previous: `6ccd4b4`	Ratio
`array/accumulate/Float32/1d`	`100964` ns	`101723` ns	`0.99`
`array/accumulate/Float32/dims=1`	`76402` ns	`76608` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1583852` ns	`1585294` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143663.5` ns	`143948` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`657387` ns	`657945` ns	`1.00`
`array/accumulate/Int64/1d`	`118535` ns	`118967` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79793.5` ns	`79956` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1694430` ns	`1694445` ns	`1.00`
`array/accumulate/Int64/dims=2`	`155648` ns	`156040` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`961725` ns	`961840` ns	`1.00`
`array/broadcast`	`20554` ns	`20347` ns	`1.01`
`array/construct`	`1290.5` ns	`1311.9` ns	`0.98`
`array/copy`	`19018` ns	`18931` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`218646` ns	`215113` ns	`1.02`
`array/copyto!/gpu_to_cpu`	`286175.5` ns	`283517` ns	`1.01`
`array/copyto!/gpu_to_gpu`	`11459` ns	`11647` ns	`0.98`
`array/iteration/findall/bool`	`132313` ns	`132615` ns	`1.00`
`array/iteration/findall/int`	`149520` ns	`149623` ns	`1.00`
`array/iteration/findfirst/bool`	`82180` ns	`82175` ns	`1.00`
`array/iteration/findfirst/int`	`84533` ns	`84437` ns	`1.00`
`array/iteration/findmin/1d`	`86113` ns	`87647` ns	`0.98`
`array/iteration/findmin/2d`	`117447` ns	`117309` ns	`1.00`
`array/iteration/logical`	`200576.5` ns	`203627.5` ns	`0.99`
`array/iteration/scalar`	`67222` ns	`68729` ns	`0.98`
`array/permutedims/2d`	`52451` ns	`52820` ns	`0.99`
`array/permutedims/3d`	`52927` ns	`52914` ns	`1.00`
`array/permutedims/4d`	`51947` ns	`51983` ns	`1.00`
`array/random/rand/Float32`	`13137` ns	`13104` ns	`1.00`
`array/random/rand/Int64`	`30364` ns	`37312` ns	`0.81`
`array/random/rand!/Float32`	`8540.666666666666` ns	`8603.333333333334` ns	`0.99`
`array/random/rand!/Int64`	`34445.5` ns	`34156` ns	`1.01`
`array/random/randn/Float32`	`38562` ns	`38723.5` ns	`1.00`
`array/random/randn!/Float32`	`31437` ns	`31520` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`34900.5` ns	`35427` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1`	`44447` ns	`49562` ns	`0.90`
`array/reductions/mapreduce/Float32/dims=1L`	`51989` ns	`51766` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56946` ns	`56838` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`69751.5` ns	`69604` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`43038` ns	`43423` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1`	`42447.5` ns	`44694.5` ns	`0.95`
`array/reductions/mapreduce/Int64/dims=1L`	`87748` ns	`87805` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59719` ns	`60051.5` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`84965.5` ns	`85186` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`35216` ns	`35458` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`45397.5` ns	`46307.5` ns	`0.98`
`array/reductions/reduce/Float32/dims=1L`	`52095` ns	`52046` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`57168` ns	`57117` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`70192` ns	`70127.5` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`42924` ns	`41220` ns	`1.04`
`array/reductions/reduce/Int64/dims=1`	`42493` ns	`51895` ns	`0.82`
`array/reductions/reduce/Int64/dims=1L`	`87747` ns	`87739` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59737` ns	`59630` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84867` ns	`84743.5` ns	`1.00`
`array/reverse/1d`	`18371` ns	`18349` ns	`1.00`
`array/reverse/1dL`	`68939.5` ns	`68960` ns	`1.00`
`array/reverse/1dL_inplace`	`65976.5` ns	`65909` ns	`1.00`
`array/reverse/1d_inplace`	`10268.666666666666` ns	`8540.833333333332` ns	`1.20`
`array/reverse/2d`	`20918` ns	`20881` ns	`1.00`
`array/reverse/2dL`	`73031` ns	`72996` ns	`1.00`
`array/reverse/2dL_inplace`	`66004` ns	`65926` ns	`1.00`
`array/reverse/2d_inplace`	`11200` ns	`10076` ns	`1.11`
`array/sorting/1d`	`2735406.5` ns	`2735188.5` ns	`1.00`
`array/sorting/2d`	`1068948` ns	`1069206` ns	`1.00`
`array/sorting/by`	`3304547` ns	`3304125.5` ns	`1.00`
`cuda/synchronization/context/auto`	`1187.9` ns	`1176.2` ns	`1.01`
`cuda/synchronization/context/blocking`	`943.6333333333333` ns	`924.5869565217391` ns	`1.02`
`cuda/synchronization/context/nonblocking`	`7155.6` ns	`6942.1` ns	`1.03`
`cuda/synchronization/stream/auto`	`1042.3` ns	`999.9375` ns	`1.04`
`cuda/synchronization/stream/blocking`	`846.5555555555555` ns	`787.7961165048544` ns	`1.07`
`cuda/synchronization/stream/nonblocking`	`8277.4` ns	`7168.6` ns	`1.15`
`integration/byval/reference`	`144027.5` ns	`143982` ns	`1.00`
`integration/byval/slices=1`	`145912` ns	`145868` ns	`1.00`
`integration/byval/slices=2`	`284580` ns	`284528` ns	`1.00`
`integration/byval/slices=3`	`423020` ns	`422970` ns	`1.00`
`integration/cudadevrt`	`102567` ns	`102612` ns	`1.00`
`integration/volumerhs`	`23411584` ns	`9440461` ns	`2.48`
`kernel/indexing`	`13408` ns	`13181` ns	`1.02`
`kernel/indexing_checked`	`14088` ns	`14081` ns	`1.00`
`kernel/launch`	`2213.6666666666665` ns	`2150.777777777778` ns	`1.03`
`kernel/occupancy`	`715.4710144927536` ns	`672` ns	`1.06`
`kernel/rand`	`14495` ns	`14396` ns	`1.01`
`latency/import`	`3831674700.5` ns	`3814290062.5` ns	`1.00`
`latency/precompile`	`4590038439` ns	`4590207670.5` ns	`1.00`
`latency/ttfp`	`4407587085.5` ns	`4409319020` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

vchuravy · 2026-04-02T14:39:53Z

@efaulhaber that was on a H100 right?

lcw · 2026-04-02T16:41:01Z

This is great. How did you get the benchmark numbers?

efaulhaber · 2026-04-07T07:40:43Z

How did you get the benchmark numbers?

This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations.

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>

codecov · 2026-04-13T17:06:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.43%. Comparing base (6ccd4b4) to head (0c041fa).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3077   +/-   ##
=======================================
  Coverage   90.43%   90.43%           
=======================================
  Files         141      141           
  Lines       12025    12025           
=======================================
  Hits        10875    10875           
  Misses       1150     1150

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot reviewed Apr 2, 2026

View reviewed changes

LasNikas mentioned this pull request Apr 9, 2026

Use fast divisions in performance-critical code trixi-framework/TrixiParticles.jl#1128

Merged

efaulhaber mentioned this pull request Apr 10, 2026

Remove CUDA FP64 fast divisions if they get integrated into CUDA.jl trixi-framework/TrixiParticles.jl#1137

Open

Implement fast_div using fast rcp

a1adfb4

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk> Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>

vchuravy force-pushed the vc/faster_div branch from e200935 to a1adfb4 Compare April 13, 2026 13:59

use inv in benchmark

0c041fa

maleadt merged commit 01a0795 into master Apr 14, 2026
2 checks passed

maleadt deleted the vc/faster_div branch April 14, 2026 07:05

vchuravy mentioned this pull request Apr 14, 2026

Add tests for fast_div and fast rcp intrinsics #3096

Merged

Mikolaj-A-Kowalski mentioned this pull request Apr 15, 2026

Use specialised implementation of newton_div on CUDA CliMA/Oceananigans.jl#5140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fast_div using fast rcp#3077

Implement fast_div using fast rcp#3077
maleadt merged 2 commits intomasterfrom
vc/faster_div

vchuravy commented Apr 2, 2026

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 2, 2026

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

lcw commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

codecov Bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vchuravy commented Apr 2, 2026

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 2, 2026

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

vchuravy commented Apr 2, 2026

Uh oh!

lcw commented Apr 2, 2026

Uh oh!

efaulhaber commented Apr 7, 2026

Uh oh!

codecov Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot left a comment •

edited

Loading

codecov Bot commented Apr 13, 2026 •

edited

Loading