Skip to content

Use specialised implementation of newton_div on CUDA#5140

Merged
simone-silvestri merged 29 commits intomainfrom
mak/fast-cuda-div
Apr 15, 2026
Merged

Use specialised implementation of newton_div on CUDA#5140
simone-silvestri merged 29 commits intomainfrom
mak/fast-cuda-div

Conversation

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator

@jackdfranklin @TomMelt

Replaces the newton_div on Nvidia GPUs with a custom implementation based on the fast path of the implementation of the division in SASS.

It uses the rcp.approx.ftz to get an initial guess for the reciprocal of the divisor (error $\epsilon = \mathcal{O}(2^{-20}) $) and enhances it with a single cubic (Halley?) iteration.

In the range [2.2250738585072014e-308; 4.494237123190219e307 ) (i.e. floatmin(Float64) to 1st number where rcp.approx.ftz underflows) it should give quite accurate results. The main source of error being substitution a/b => a * 1/b and few ULPs in the evaluation of the reciprocal.

I have not checked the performance benefit on the current main... but when run on the branch with the
benchmark case from #4882 we get ~7% runtime improvement in the Gu advection kernel. I attach the reports: reports.tar.gz

The biggest danger with the implementation is that it has not great underflow behaviours. For values x < floatmin(Float64) it evaluates to NaN (as opposed to Inf). This should not be an issue though in the context where newton_div is used as the denominator in these cases seems to include regularisation constant $\epsilon$.

One open question and todo is adding a test for the implementation. Will test_cuda.jl be appropriate place for that?

@glwagner
Copy link
Copy Markdown
Member

One open question and todo is adding a test for the implementation. Will test_cuda.jl be appropriate place for that?

Yes, test_cuda.jl looks like the perfect place!

Copy link
Copy Markdown
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean and simple, so if this gives performance improvements and tests pass, I approve.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

We may need to pause merging for a bit since I run into a small problem when adding a test.

The expression of the type:

output_via_f32 .= Oceananigans.Utils.newton_div.(Float32, π, test_input)

does seem to ignore the override from OceananigansCUDAExt.

Will need to have a look into why it is that.

@glwagner
Copy link
Copy Markdown
Member

output_via_f32 .= Oceananigans.Utils.newton_div.(Float32, π, test_input)

it might be because π is not Float32 or Float64 (it is Irrational). The override is only defined for f32 and f64.

Comment thread ext/OceananigansCUDAExt.jl Outdated
Comment thread ext/OceananigansCUDAExt.jl Outdated
@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

Mikolaj-A-Kowalski commented Jan 15, 2026

it might be because π is not Float32 or Float64 (it is Irrational). The override is only defined for f32 and f64.

Thank you! It was exactly the problem!

I have pushed the test and removed the type annotation for the numerator in the CUDA override.

Comment thread test/test_cuda.jl Outdated
Comment thread test/test_cuda.jl Outdated
@testset "CUDA newton_div" begin
# Test that error is small for random denominators from a single binade
Random.seed!(44)
test_input = CuArray(rand(1024)) .+ 1.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could make sense also to add a test that Float32 input behaves as expected

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an extra @testset for Float32 just checking that it predicts a reasonable approximation (i.e. within $\sqrt{\epsilon}$ as is default for isapprox).

Although I keep feeling that it is and odd place for it to be since we are not providing any special implementation for Float32 for CUDA (at the moment, to be fair we could do that, currently for Float32 there is still slow-path and extra associated complexity to handle subnormal numbers).

Perhaps I should also add test_newton_div.jl file and make similar tests for the CPU? With the similar permissive tolerance of $\sqrt{\epsilon}$?

@navidcy navidcy added numerics 🧮 So things don't blow up and boil the lobsters alive GPU 👾 Where Oceananigans gets its powers from extensions 🧬 labels Jan 15, 2026
# while treating all subnormal numbers as 0.0
# If reciprocal would be subnormal, underflows to 0.0
# 32 least significant bits of the result are filled with 0s
inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsmoses @avik-pal Reactant tests on CPU are understandably failing with https://buildkite.com/clima/oceananigans/builds/28492#019bc195-ae24-48aa-82c6-1e534be018b3/L709

LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.rcp.approx.ftz.d

Is there a way to tell Reactant to ignore this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so reactant currently faithfully fully represents the cuda kernel. There is no mechanism for determining if you're in reactant or not within a kernel -- we get the code out of cuda.jl. We could add a mechanism for doing so here: https://github.com/EnzymeAD/Reactant.jl/blob/730804beb548017e819b3589b9f6f793a4f19792/ext/ReactantCUDAExt.jl#L1332. We'd need to make a new gpucompiler job/config/state that looks identical to the cuda one, except with a second method overlay table that itself overlays the cuda overlay table

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, alternatively, in the Reactant extension can we use Reactant.@reactant_overlay for these methods?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but you have to overlay the whole kernel. reactant_overlay does not apply within a cuda kernel [my comment above was describing what would be required for something within kerenl to be overlayed]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be sufficient to define overlaid methods for UT.newton_div (what's done here for CUDA) for Reactant? Wouldn't that override the overlay here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda kernels are still compiled with normal cuda.jl. the reactant overlay method table only applies for generic code outside of kernels.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok ok, got it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't Reactant/polygeist/whatever raises this to MLIR just handle that intrinsic as fast 1/x and then backends can emit fast reciprocals if they want to. (I blame LLVM for having 50 rcp intrinsics that are all different)

Comment thread src/Advection/weno_reconstruction.jl Outdated
@glwagner
Copy link
Copy Markdown
Member

glwagner commented Feb 5, 2026

@Mikolaj-A-Kowalski looks like there are some conflicts you may have to resolve 😭

Also a doctest is failing -- I'm happy to take care of that (let me know and I will push fixes to this branch)

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

No need! I will resolve the conflicts and update failing doctests tomorrow morning. Sadly I didn't have a time this afternoon to see the CI 😅 I was suspecting i might have left something out

@giordano giordano added the breaking change 💔 Concerning a change which breaks the API label Feb 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 73.91304% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.14%. Comparing base (b8012d5) to head (3c90552).

Files with missing lines Patch % Lines
ext/OceananigansCUDAExt.jl 0.00% 10 Missing ⚠️
src/Advection/materialize_advection.jl 90.90% 1 Missing ⚠️
src/Advection/weno_reconstruction.jl 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5140      +/-   ##
==========================================
- Coverage   74.16%   74.14%   -0.02%     
==========================================
  Files         402      403       +1     
  Lines       23140    23170      +30     
==========================================
+ Hits        17161    17180      +19     
- Misses       5979     5990      +11     
Flag Coverage Δ
buildkite 68.66% <71.73%> (+0.01%) ⬆️
julia 68.66% <71.73%> (+0.01%) ⬆️
reactant_1 6.46% <0.00%> (-0.01%) ⬇️
reactant_2 11.36% <13.04%> (+0.01%) ⬆️
reactant_3 9.76% <36.95%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

Hopefully after this recent push the cuda tests should actually run which should remove the test coverage problem.

Another issue is that currently no run with the netwon_div is done in the benchmarks, should I add/modify some?

@Mikolaj-A-Kowalski Mikolaj-A-Kowalski force-pushed the mak/fast-cuda-div branch 3 times, most recently from 478a761 to 18e7134 Compare February 27, 2026 11:46
@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

I have cleaned the history and rebased on current main.

@glwagner I think the only potential remaining issues are:

  • Failing coverage check
  • The usage of the backend-specific scheme as a default since currently it needs to be explicitly specified by a user

For the failing check my feeling is that it is mostly because there is no coverage data from the GPU, hence the new CUDA-specific division functions are reported as untested despite being covered by unit tests. However, I am not 100% sure I interpret correctly. Perhaps I messed up something in the setup and the GPU newton_div unit tests do not run?

For the change of the default behaviour is is something we want to do? I imagine it will take defining an extra function that will dispatch on the architecture and return a BackendOptimizedDivision or ConvertingDivision.

giordano added a commit to NumericalEarth/Breeze.jl that referenced this pull request Feb 27, 2026
giordano added a commit to NumericalEarth/Breeze.jl that referenced this pull request Feb 27, 2026
@simone-silvestri
Copy link
Copy Markdown
Collaborator

I think there is an issue with FP32

----------------------------------------------------------------------
--
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
ERROR: a error was thrown during kernel execution on thread (225, 1, 1) in block (64, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).
 
[ Info:   Running benchmark...
ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA TITAN V
Stacktrace:
[1] check_exceptions()
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/exceptions.jl:39
[2] device_synchronize(; blocking::Bool, spin::Bool)
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:191
[3] device_synchronize
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:178 [inlined]
[4] checked_cuModuleLoadDataEx(_module::Base.RefValue{Ptr{CUDA.CUmod_st}}, image::Ptr{UInt8}, numOptions::Int64, options::Vector{CUDA.CUjit_option_enum}, optionValues::Vector{Ptr{Nothing}})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:18
[5] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:60
[6] CuModule
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:49 [inlined]
[7] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/compilation.jl:409
[8] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:270
[9] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:159
[10] macro expansion
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/execution.jl:373 [inlined]
[11] macro expansion
@ ./lock.jl:376 [inlined]
[12] cufunction(f::typeof(Oceananigans.Models.HydrostaticFreeSurfaceModels.gpu_compute_hydrostatic_free_surface_Gv!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1749935,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(6836,)}, KernelAbstractions.NDIteration.StaticSize{(256,)}, Oceananigans.Utils.IndexMap, CUDA.CuDeviceVector{Tuple{UInt16, UInt16, UInt16}, 1}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, ImmersedBoundaryGrid{Float32, Periodic, RightCenterFolded, Bounded, OrthogonalSphericalShellGrid{Float32, Periodic, RightCenterFolded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}}, Oceananigans.OrthogonalSphericalShellGrids.Tripolar{Int64, Int64, Int64}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, Nothing, Float32}, PartialCellBottom{Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Float32}, Nothing, Nothing, Nothing}, Tuple{WENOVectorInvariant{5, 5, Float32, false, WENO{5, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{4, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}, Centered{4, Float32, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}}, Oceananigans.Advection.VelocityStencil, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Oceananigans.Advection.OnlySelfUpwinding{Centered{2, Float32, Centered{1, Float32, Nothing}}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.u_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.v_smoothness)}}}, HydrostaticSphericalCoriolis{Oceananigans.Advection.EnstrophyConserving{Float32}, Float32, Oceananigans.Coriolis.HydrostaticFormulation}, CATKEVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEMixingLength{Float64}, Float64, Nothing, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEEquation{Float64}}, Oceananigans.BoundaryConditions.NoFluxBoundaryCondition, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, w::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, SplitExplicitFreeSurface{true, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{U::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, @NamedTuple{η̅::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, U̅::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V̅::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ũ::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ṽ::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, Float32, Nothing, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.FixedSubstepNumber{Float32, NTuple{21, Float32}}, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.ForwardBackwardScheme}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, BuoyancyForce{SeawaterBuoyancy{Float32, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float32}, Float32}, Nothing, Nothing}, Oceananigans.Grids.NegativeZDirection, Nothing}, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEClosureFields{OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::Oceananigans.Fields.ZeroField{Int64, 3}, S::Oceananigans.Fields.ZeroField{Int64, 3}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{}, ZCoordinate, @NamedTuple{time::Float64, last_Δt::Float64, last_stage_Δt::Float64, iteration::Int64, stage::Int64}, Returns{Float32}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
 ```

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

Mikolaj-A-Kowalski commented Mar 31, 2026

I think there is an issue with FP32

If I read it correctly the benchmark name correctly it was running in Float64. There is also a failing unit test for test_utils.jl. Given that it was passing before the merges of main, it may be that they brought some extra changes that break some things.

Edit:
Confirmed that both the benchmark and the test pass on my side for fed538a

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

I fear that this one may be tough to solve since I have run both the unit test and the benchmark locally (on fed538a ). They both complete just fine...

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

Does the benchmark fail in the same way still? I don't have a permission to see the job output.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

I will make the pipeline public so everyone can see the output

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Indeed, it fails with

===============================================================================================
--
 
 
----------------------------------------------------------------------
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
ERROR: a error was thrown during kernel execution on thread (97, 1, 1) in block (24, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).
 

Is this reproducible on CPU? Or probably the division paths are different. I would try passing a -g2 to print debug statements

@simone-silvestri
Copy link
Copy Markdown
Collaborator

@Mikolaj-A-Kowalski do you have access to the pipeline now?

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

@Mikolaj-A-Kowalski do you have access to the pipeline now?

Yes I do. Thanks.

I will have another go at reproducing the failure. Sadly I had bad experience with -g2 as it would put unicode variable names into comments in PTX and crash ptxas... but this was few months ago,

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

----------------------------------------------------------------------                                                                                 
Running: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr                                                           
----------------------------------------------------------------------                                                                                 
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
[ Info:   Running benchmark...
[ Info:   Results:
[ Info:     Total time: 19.676 s
[ Info:     Time per step: 0.196757 s
[ Info:     Grid points/s: 1.65e+07
[ Info:     GPU memory usage: 1011.441 MiB

Still executes fine when I try it on L40 locally.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Is that with FP64?

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

Is that with FP64?

Yes. Sorry I based it on the snippet you posted:

----------------------------------------------------------------------
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...

I think it contained typo in [ Info: Float type: Float64. I get the failure with Float32, which is good.

There was a type instability and test input was getting promoted to
Float64. As a result the Float32 was never verified and a typo in the
intrinsic name was not caught ealier.
@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

It is quite horrifying but there was a type instability in the test and the Float32 case was still running Float64 😬

Hence it did not catch the typo in the LLVM intrinsic.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 37131669.116 35415290.875 +4.8%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 25230684.632 26195977.811 -3.7%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 38247442.700 36840867.877 +3.8%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 49131110.860 50110646.519 -2%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 68745039.739 65661292.356 +4.7%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 28365027.916 27332212.838 +3.8%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 11015141.236 10815017.294 +1.9%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 67179970.255 65455798.900 +2.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 45469929.230 42639967.205 +6.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 26699253.153 25036217.146 +6.6%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33533826.486 32189917.038 +4.2%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 34755619.123 33293196.104 +4.4%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 34727385.051 33706443.655 +3%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 39296884.482 37453902.170 +4.9%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 37877809.342 36560393.742 +3.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 33283009.009 32399029.587 +2.7%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu__rk_substep_turbulent_kinetic_energy_ 4.013 4.012 +0.0% 315 4.022 4.007 4.355
gpu_compute_hydrostatic_free_surface_Gu_ 3.738 3.953 -5.4% 318 3.727 2.590 3.872
gpu_compute_hydrostatic_free_surface_Gv_ 3.680 3.785 -2.8% 318 3.684 2.555 3.980
gpu_compute_hydrostatic_free_surface_Gc_ 2.331 2.523 -7.6% 315 2.338 2.293 2.658
gpu_compute_hydrostatic_free_surface_Gc_ 2.308 2.523 -8.5% 315 2.317 2.272 2.615
gpu_compute_hydrostatic_free_surface_Gc_ 2.297 2.523 -8.9% 315 2.306 2.259 2.633
gpu_compute_CATKE_closure_fields_ 1.615 1.618 -0.2% 318 1.619 1.608 1.745
gpu_compute_TKE_diffusivity_ 1.269 1.270 -0.1% 315 1.272 1.264 1.378
gpu__compute_w_from_continuity_ 0.339 0.340 -0.3% 633 0.339 0.335 0.345
gpu_solve_batched_tridiagonal_system_kernel_ 0.527 0.526 +0.2% 315 0.527 0.519 0.542

@glwagner
Copy link
Copy Markdown
Member

glwagner commented Apr 1, 2026

not zero speed up!

@simone-silvestri
Copy link
Copy Markdown
Collaborator

@Mikolaj-A-Kowalski want to merge?

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 42421181.780 40929777.557 +3.6%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 25770151.172 28354992.513 -9.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 43383326.198 41627380.748 +4.2%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 57596238.489 55857098.415 +3.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 68459963.001 65172936.802 +5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 31305831.680 30169289.515 +3.8%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 11330095.631 11199917.487 +1.2%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 87163771.337 87049954.749 +0.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 53937198.007 49816168.222 +8.3%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 29401174.053 27490822.223 +6.9%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33289952.083 31976314.693 +4.1%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 39029556.572 37071701.680 +5.3%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 39555094.696 38337157.558 +3.2%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 38997420.947 37259153.682 +4.7%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 42833123.576 41159666.620 +4.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 38494237.090 36844339.754 +4.5%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu_compute_hydrostatic_free_surface_Gu_ 3.731 3.955 -5.7% 318 3.719 2.593 3.874
gpu_compute_hydrostatic_free_surface_Gv_ 3.688 3.786 -2.6% 318 3.687 2.552 3.972
gpu_compute_hydrostatic_free_surface_Gc_ 2.333 2.522 -7.5% 315 2.339 2.295 2.680
gpu_compute_hydrostatic_free_surface_Gc_ 2.308 2.522 -8.5% 315 2.316 2.265 2.638
gpu_compute_hydrostatic_free_surface_Gc_ 2.301 2.522 -8.8% 315 2.307 2.263 2.613
gpu__rk_substep_turbulent_kinetic_energy_ 1.989 1.991 -0.1% 315 1.993 1.985 2.173
gpu_compute_CATKE_closure_fields_ 1.565 1.567 -0.1% 318 1.569 1.558 1.695
gpu__compute_w_from_continuity_ 0.342 0.342 +0.1% 633 0.342 0.337 0.349
gpu_compute_TKE_diffusivity_ 0.642 0.643 -0.2% 315 0.643 0.637 0.692
gpu__compute_split_explicit_transport_velocities_ 0.482 0.492 -1.9% 315 0.482 0.479 0.487

@glwagner
Copy link
Copy Markdown
Member

we should merge this before too long

@Mikolaj-A-Kowalski
Copy link
Copy Markdown
Collaborator Author

@Mikolaj-A-Kowalski want to merge?

Sorry for the slight delay. I was thinking it might be better for you to do it. I think the last thing the PR is missing is the minor version bump? Presumably if it is there and the PR is merged the CI will take care of making the release.

Also to keep note our 'fast-path only' version of division has made it to CUDA.jl as the backend for Base.FastMath.div_fast JuliaGPU/CUDA.jl#3077 so we might be able to relay on that from CUDA version 6.1 onwards.

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Amazing! Then I'll bump minor version and merge. When CUDA 6.1 is out we can revisit the implementation and rely on the CUDA implementation.

@simone-silvestri simone-silvestri merged commit c109ce4 into main Apr 15, 2026
15 of 19 checks passed
@simone-silvestri simone-silvestri deleted the mak/fast-cuda-div branch April 15, 2026 09:08
@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark PR (pts/s) Main (pts/s) Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 42441399.560 40929777.557 +3.7%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 24633970.979 28354992.513 -13.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 43409194.304 41627380.748 +4.3%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr 59485337.224 55857098.415 +6.5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr 68429661.111 65172936.802 +5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr 31292349.127 30169289.515 +3.7%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr 11332231.661 11199917.487 +1.2%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr 86612303.662 87049954.749 -0.5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr 53906900.321 49816168.222 +8.2%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr 29404981.649 27490822.223 +7%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 33341379.496 31976314.693 +4.3%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 38972744.632 37071701.680 +5.1%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 39670798.728 38337157.558 +3.5%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 39010991.545 37259153.682 +4.7%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr 42823618.500 41159666.620 +4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr 38482502.919 36844339.754 +4.4%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel Median (ms) Main (ms) Change Instances Avg (ms) Min (ms) Max (ms)
gpu_compute_hydrostatic_free_surface_Gu_ 3.740 3.955 -5.4% 318 3.726 2.593 3.880
gpu_compute_hydrostatic_free_surface_Gv_ 3.699 3.786 -2.3% 318 3.696 2.569 3.973
gpu_compute_hydrostatic_free_surface_Gc_ 2.331 2.522 -7.6% 315 2.338 2.295 2.668
gpu_compute_hydrostatic_free_surface_Gc_ 2.308 2.522 -8.5% 315 2.316 2.274 2.651
gpu_compute_hydrostatic_free_surface_Gc_ 2.300 2.522 -8.8% 315 2.308 2.264 2.618
gpu__rk_substep_turbulent_kinetic_energy_ 1.991 1.991 +0.0% 315 1.995 1.988 2.176
gpu_compute_CATKE_closure_fields_ 1.567 1.567 -0.0% 318 1.570 1.558 1.703
gpu__compute_w_from_continuity_ 0.341 0.342 -0.3% 633 0.341 0.336 0.347
gpu_compute_TKE_diffusivity_ 0.643 0.643 +0.1% 315 0.644 0.639 0.694
gpu__compute_split_explicit_transport_velocities_ 0.491 0.492 -0.1% 315 0.491 0.488 0.496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark performance runs preconfigured benchamarks and spits out timing breaking change 💔 Concerning a change which breaks the API extensions 🧬 GPU 👾 Where Oceananigans gets its powers from numerics 🧮 So things don't blow up and boil the lobsters alive

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants