Skip to content

Implement fast_div using fast rcp#3077

Merged
maleadt merged 2 commits intomasterfrom
vc/faster_div
Apr 14, 2026
Merged

Implement fast_div using fast rcp#3077
maleadt merged 2 commits intomasterfrom
vc/faster_div

Conversation

@vchuravy
Copy link
Copy Markdown
Member

@vchuravy vchuravy commented Apr 2, 2026

Working with @efaulhaber two weeks ago, I was reminded of the slowness of division on Nvidia GPUs.

On top of that currently @fastmath a/b for Float64 just becomes a fdiv fast which then becomes a normal NVPTX division,
and SASS helpfully turns into a function call.

@efaulhaber has some numbers for his hot kernel:

# fdiv double %143, %144, !dbg !528
# 10.159 ms
# return x / y

# fdiv fast double %143, %144, !dbg !530
# 10.114 ms
# return Base.FastMath.div_fast(x, y)

Using the simple implementation of a/b = a * 1/b:

# fdiv double 1.000000e+00, %145, !dbg !530
# fmul double %144, %146, !dbg !533
# 6.878 ms
# return x * (1 / y)

# fdiv double 1.000000e+00, %145, !dbg !532
# fmul double %144, %146, !dbg !535
# 6.852 ms
# return x * inv(y)

did speed his code up, but that might be more to do with additional code motion opportunity
this affords.

As an example NVIDIA warp
uses the approx.ftz instruction to obtain a fast_div implementation.

Which using @efaulhaber measurements:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fmul double %117, %119, !dbg !418
# 4.758 ms
# return x * fast_inv_cuda_nofma(y)

But what is the loss of accuracy we are incurring here?

julia> y2 = CUDA.rand(Float64, 100_000);
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
0.007475190221157391

# Without numbers close to zero
julia> y2 = CUDA.rand(Float64, 100_000) .+ 0.1;
julia> maximum(inv.(y2) .- fast_inv_cuda_nofma.(y2))
7.105216770497691e-6

Pretty bad. Meanwhile Oceananigans is facing a similar problem:
CliMA/Oceananigans.jl#5140 where @Mikolaj-A-Kowalski
is improving the accuracy of the inv_fast by performing an additional iteration.

@efaulhaber tested this as:

# call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
# fneg double %118, !dbg !418
# call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
# call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
# call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
# fmul double %117, %123, !dbg !426
# 4.844 ms
# return x * fast_inv_cuda(y)

So a very small additional cost.

But the gain in accuracy is significant:

julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
7.105427357601002e-15
julia> maximum(inv.(y2) .- fast_inv_cuda.(y2))
8.881784197001252e-16

@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

@lcw this might interest you since we were trying things along the line in the volumerhs benchmark

let (jlf, f) = (:div_arcp, :div)
for (T, llvmT) in ((:Float32, "float"), (:Float64, "double"))
ir = """
%x = f$f fast $llvmT %0, %1
ret $llvmT %x
"""
@eval begin
# the @pure is necessary so that we can constant propagate.
@inline Base.@pure function $jlf(a::$T, b::$T)
Base.llvmcall($ir, $T, Tuple{$T, $T}, a, b)
end
end
end
@eval function $jlf(args...)
Base.$jlf(args...)
end
end
rcp(x) = div_arcp(one(x), x) # still leads to rcp.rn which is also a function call
# div_fast(x::Float32, y::Float32) = ccall("extern __nv_fast_fdividef", llvmcall, Cfloat, (Cfloat, Cfloat), x, y)
# rcp(x) = div_fast(one(x), x)

@efaulhaber
Copy link
Copy Markdown
Contributor

I made this table to see what is happening with FP32 and FP64:

Variant LLVM Operation (FP32) Runtime (FP32) LLVM Operation (FP64) Runtime (FP64)
x / y fdiv float %143, %144, !dbg !528 4.894 ms fdiv double %143, %144, !dbg !528 10.159 ms
x * (1 / y) fdiv float 1.000000e+00, %145, !dbg !530
fmul float %144, %146, !dbg !533
4.227 ms fdiv double 1.000000e+00, %145, !dbg !530
fmul double %144, %146, !dbg !533
6.878 ms
x * inv(y) call float @llvm.nvvm.rcp.rn.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.620 ms fdiv double 1.000000e+00, %145, !dbg !532
fmul double %144, %146, !dbg !535
6.852 ms
x * fast_inv_cuda(y) call float @llvm.nvvm.rcp.approx.ftz.f(float %118), !dbg !413
fmul float %117, %119, !dbg !418
3.281 ms call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fneg double %118, !dbg !418
call double @llvm.fma.f64(double %119, double %120, double 1.000000e+00), !dbg !420
call double @llvm.fma.f64(double %121, double %121, double %121), !dbg !422
call double @llvm.fma.f64(double %122, double %119, double %119), !dbg !424
fmul double %117, %123, !dbg !426
4.844 ms
Base.FastMath.div_fast(x, y) call float @llvm.nvvm.div.approx.f(float %118, float %115), !dbg !413 3.114 ms fdiv fast double %143, %144, !dbg !530 10.114 ms
x * fast_inv_cuda_nofma(y) call double @llvm.nvvm.rcp.approx.ftz.d(double %118), !dbg !413
fmul double %117, %119, !dbg !418
4.758 ms

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 0c041fa Previous: 6ccd4b4 Ratio
array/accumulate/Float32/1d 100964 ns 101723 ns 0.99
array/accumulate/Float32/dims=1 76402 ns 76608 ns 1.00
array/accumulate/Float32/dims=1L 1583852 ns 1585294 ns 1.00
array/accumulate/Float32/dims=2 143663.5 ns 143948 ns 1.00
array/accumulate/Float32/dims=2L 657387 ns 657945 ns 1.00
array/accumulate/Int64/1d 118535 ns 118967 ns 1.00
array/accumulate/Int64/dims=1 79793.5 ns 79956 ns 1.00
array/accumulate/Int64/dims=1L 1694430 ns 1694445 ns 1.00
array/accumulate/Int64/dims=2 155648 ns 156040 ns 1.00
array/accumulate/Int64/dims=2L 961725 ns 961840 ns 1.00
array/broadcast 20554 ns 20347 ns 1.01
array/construct 1290.5 ns 1311.9 ns 0.98
array/copy 19018 ns 18931 ns 1.00
array/copyto!/cpu_to_gpu 218646 ns 215113 ns 1.02
array/copyto!/gpu_to_cpu 286175.5 ns 283517 ns 1.01
array/copyto!/gpu_to_gpu 11459 ns 11647 ns 0.98
array/iteration/findall/bool 132313 ns 132615 ns 1.00
array/iteration/findall/int 149520 ns 149623 ns 1.00
array/iteration/findfirst/bool 82180 ns 82175 ns 1.00
array/iteration/findfirst/int 84533 ns 84437 ns 1.00
array/iteration/findmin/1d 86113 ns 87647 ns 0.98
array/iteration/findmin/2d 117447 ns 117309 ns 1.00
array/iteration/logical 200576.5 ns 203627.5 ns 0.99
array/iteration/scalar 67222 ns 68729 ns 0.98
array/permutedims/2d 52451 ns 52820 ns 0.99
array/permutedims/3d 52927 ns 52914 ns 1.00
array/permutedims/4d 51947 ns 51983 ns 1.00
array/random/rand/Float32 13137 ns 13104 ns 1.00
array/random/rand/Int64 30364 ns 37312 ns 0.81
array/random/rand!/Float32 8540.666666666666 ns 8603.333333333334 ns 0.99
array/random/rand!/Int64 34445.5 ns 34156 ns 1.01
array/random/randn/Float32 38562 ns 38723.5 ns 1.00
array/random/randn!/Float32 31437 ns 31520 ns 1.00
array/reductions/mapreduce/Float32/1d 34900.5 ns 35427 ns 0.99
array/reductions/mapreduce/Float32/dims=1 44447 ns 49562 ns 0.90
array/reductions/mapreduce/Float32/dims=1L 51989 ns 51766 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56946 ns 56838 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69751.5 ns 69604 ns 1.00
array/reductions/mapreduce/Int64/1d 43038 ns 43423 ns 0.99
array/reductions/mapreduce/Int64/dims=1 42447.5 ns 44694.5 ns 0.95
array/reductions/mapreduce/Int64/dims=1L 87748 ns 87805 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59719 ns 60051.5 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 84965.5 ns 85186 ns 1.00
array/reductions/reduce/Float32/1d 35216 ns 35458 ns 0.99
array/reductions/reduce/Float32/dims=1 45397.5 ns 46307.5 ns 0.98
array/reductions/reduce/Float32/dims=1L 52095 ns 52046 ns 1.00
array/reductions/reduce/Float32/dims=2 57168 ns 57117 ns 1.00
array/reductions/reduce/Float32/dims=2L 70192 ns 70127.5 ns 1.00
array/reductions/reduce/Int64/1d 42924 ns 41220 ns 1.04
array/reductions/reduce/Int64/dims=1 42493 ns 51895 ns 0.82
array/reductions/reduce/Int64/dims=1L 87747 ns 87739 ns 1.00
array/reductions/reduce/Int64/dims=2 59737 ns 59630 ns 1.00
array/reductions/reduce/Int64/dims=2L 84867 ns 84743.5 ns 1.00
array/reverse/1d 18371 ns 18349 ns 1.00
array/reverse/1dL 68939.5 ns 68960 ns 1.00
array/reverse/1dL_inplace 65976.5 ns 65909 ns 1.00
array/reverse/1d_inplace 10268.666666666666 ns 8540.833333333332 ns 1.20
array/reverse/2d 20918 ns 20881 ns 1.00
array/reverse/2dL 73031 ns 72996 ns 1.00
array/reverse/2dL_inplace 66004 ns 65926 ns 1.00
array/reverse/2d_inplace 11200 ns 10076 ns 1.11
array/sorting/1d 2735406.5 ns 2735188.5 ns 1.00
array/sorting/2d 1068948 ns 1069206 ns 1.00
array/sorting/by 3304547 ns 3304125.5 ns 1.00
cuda/synchronization/context/auto 1187.9 ns 1176.2 ns 1.01
cuda/synchronization/context/blocking 943.6333333333333 ns 924.5869565217391 ns 1.02
cuda/synchronization/context/nonblocking 7155.6 ns 6942.1 ns 1.03
cuda/synchronization/stream/auto 1042.3 ns 999.9375 ns 1.04
cuda/synchronization/stream/blocking 846.5555555555555 ns 787.7961165048544 ns 1.07
cuda/synchronization/stream/nonblocking 8277.4 ns 7168.6 ns 1.15
integration/byval/reference 144027.5 ns 143982 ns 1.00
integration/byval/slices=1 145912 ns 145868 ns 1.00
integration/byval/slices=2 284580 ns 284528 ns 1.00
integration/byval/slices=3 423020 ns 422970 ns 1.00
integration/cudadevrt 102567 ns 102612 ns 1.00
integration/volumerhs 23411584 ns 9440461 ns 2.48
kernel/indexing 13408 ns 13181 ns 1.02
kernel/indexing_checked 14088 ns 14081 ns 1.00
kernel/launch 2213.6666666666665 ns 2150.777777777778 ns 1.03
kernel/occupancy 715.4710144927536 ns 672 ns 1.06
kernel/rand 14495 ns 14396 ns 1.01
latency/import 3831674700.5 ns 3814290062.5 ns 1.00
latency/precompile 4590038439 ns 4590207670.5 ns 1.00
latency/ttfp 4407587085.5 ns 4409319020 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@vchuravy
Copy link
Copy Markdown
Member Author

vchuravy commented Apr 2, 2026

@efaulhaber that was on a H100 right?

@lcw
Copy link
Copy Markdown
Contributor

lcw commented Apr 2, 2026

This is great. How did you get the benchmark numbers?

@efaulhaber
Copy link
Copy Markdown
Contributor

How did you get the benchmark numbers?

This is benchmarking the main kernel of TrixiParticles.jl, which is an SPH neighbor loop computing the forces on particles. There are two divisions in the hot loop, for which I then used the different fast division implementations.

Co-authored-by: M. A. Kowalski <mak60@cam.ac.uk>
Co-authored-by: Erik Faulhaber <erik.faulhaber@web.de>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.43%. Comparing base (6ccd4b4) to head (0c041fa).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3077   +/-   ##
=======================================
  Coverage   90.43%   90.43%           
=======================================
  Files         141      141           
  Lines       12025    12025           
=======================================
  Hits        10875    10875           
  Misses       1150     1150           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit 01a0795 into master Apr 14, 2026
2 checks passed
@maleadt maleadt deleted the vc/faster_div branch April 14, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants