Use specialised implementation of `newton_div` on CUDA by Mikolaj-A-Kowalski · Pull Request #5140 · CliMA/Oceananigans.jl

Mikolaj-A-Kowalski · 2026-01-14T16:52:23Z

Replaces the newton_div on Nvidia GPUs with a custom implementation based on the fast path of the implementation of the division in SASS.

It uses the rcp.approx.ftz to get an initial guess for the reciprocal of the divisor (error $\epsilon = \mathcal{O}(2^{-20}) $) and enhances it with a single cubic (Halley?) iteration.

In the range [2.2250738585072014e-308; 4.494237123190219e307 ) (i.e. floatmin(Float64) to 1st number where rcp.approx.ftz underflows) it should give quite accurate results. The main source of error being substitution a/b => a * 1/b and few ULPs in the evaluation of the reciprocal.

I have not checked the performance benefit on the current main... but when run on the branch with the
benchmark case from #4882 we get ~7% runtime improvement in the Gu advection kernel. I attach the reports: reports.tar.gz

The biggest danger with the implementation is that it has not great underflow behaviours. For values x < floatmin(Float64) it evaluates to NaN (as opposed to Inf). This should not be an issue though in the context where newton_div is used as the denominator in these cases seems to include regularisation constant $\epsilon$.

One open question and todo is adding a test for the implementation. Will test_cuda.jl be appropriate place for that?

glwagner · 2026-01-14T16:54:16Z

One open question and todo is adding a test for the implementation. Will test_cuda.jl be appropriate place for that?

Yes, test_cuda.jl looks like the perfect place!

glwagner

clean and simple, so if this gives performance improvements and tests pass, I approve.

Mikolaj-A-Kowalski · 2026-01-14T17:45:23Z

We may need to pause merging for a bit since I run into a small problem when adding a test.

The expression of the type:

output_via_f32 .= Oceananigans.Utils.newton_div.(Float32, π, test_input)

does seem to ignore the override from OceananigansCUDAExt.

Will need to have a look into why it is that.

glwagner · 2026-01-14T20:47:08Z

output_via_f32 .= Oceananigans.Utils.newton_div.(Float32, π, test_input)

it might be because π is not Float32 or Float64 (it is Irrational). The override is only defined for f32 and f64.

Mikolaj-A-Kowalski · 2026-01-15T11:18:06Z

it might be because π is not Float32 or Float64 (it is Irrational). The override is only defined for f32 and f64.

Thank you! It was exactly the problem!

I have pushed the test and removed the type annotation for the numerator in the CUDA override.

glwagner · 2026-01-15T15:38:36Z

+@testset "CUDA newton_div" begin
+    # Test that error is small for random denominators from a single binade
+    Random.seed!(44)
+    test_input = CuArray(rand(1024)) .+ 1.0


it could make sense also to add a test that Float32 input behaves as expected

I have added an extra @testset for Float32 just checking that it predicts a reasonable approximation (i.e. within $\sqrt{\epsilon}$ as is default for isapprox).

Although I keep feeling that it is and odd place for it to be since we are not providing any special implementation for Float32 for CUDA (at the moment, to be fair we could do that, currently for Float32 there is still slow-path and extra associated complexity to handle subnormal numbers).

Perhaps I should also add test_newton_div.jl file and make similar tests for the CPU? With the similar permissive tolerance of $\sqrt{\epsilon}$?

giordano · 2026-01-15T23:01:34Z

+    # while treating all subnormal numbers as 0.0
+    # If reciprocal would be subnormal, underflows to 0.0
+    # 32 least significant bits of the result are filled with 0s
+    inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)


@wsmoses @avik-pal Reactant tests on CPU are understandably failing with https://buildkite.com/clima/oceananigans/builds/28492#019bc195-ae24-48aa-82c6-1e534be018b3/L709

LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.rcp.approx.ftz.d

Is there a way to tell Reactant to ignore this?

so reactant currently faithfully fully represents the cuda kernel. There is no mechanism for determining if you're in reactant or not within a kernel -- we get the code out of cuda.jl. We could add a mechanism for doing so here: https://github.com/EnzymeAD/Reactant.jl/blob/730804beb548017e819b3589b9f6f793a4f19792/ext/ReactantCUDAExt.jl#L1332. We'd need to make a new gpucompiler job/config/state that looks identical to the cuda one, except with a second method overlay table that itself overlays the cuda overlay table

Or, alternatively, in the Reactant extension can we use Reactant.@reactant_overlay for these methods?

yes but you have to overlay the whole kernel. reactant_overlay does not apply within a cuda kernel [my comment above was describing what would be required for something within kerenl to be overlayed]

Wouldn't it be sufficient to define overlaid methods for UT.newton_div (what's done here for CUDA) for Reactant? Wouldn't that override the overlay here?

cuda kernels are still compiled with normal cuda.jl. the reactant overlay method table only applies for generic code outside of kernels.

Ok ok, got it.

Can't Reactant/polygeist/whatever raises this to MLIR just handle that intrinsic as fast 1/x and then backends can emit fast reciprocals if they want to. (I blame LLVM for having 50 rcp intrinsics that are all different)

glwagner · 2026-02-05T20:31:19Z

@Mikolaj-A-Kowalski looks like there are some conflicts you may have to resolve 😭

Also a doctest is failing -- I'm happy to take care of that (let me know and I will push fixes to this branch)

Mikolaj-A-Kowalski · 2026-02-05T20:34:06Z

No need! I will resolve the conflicts and update failing doctests tomorrow morning. Sadly I didn't have a time this afternoon to see the CI 😅 I was suspecting i might have left something out

codecov · 2026-02-13T15:47:14Z

Codecov Report

❌ Patch coverage is 73.91304% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.14%. Comparing base (b8012d5) to head (3c90552).

Files with missing lines	Patch %	Lines
ext/OceananigansCUDAExt.jl	0.00%	10 Missing ⚠️
src/Advection/materialize_advection.jl	90.90%	1 Missing ⚠️
src/Advection/weno_reconstruction.jl	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5140      +/-   ##
==========================================
- Coverage   74.16%   74.14%   -0.02%     
==========================================
  Files         402      403       +1     
  Lines       23140    23170      +30     
==========================================
+ Hits        17161    17180      +19     
- Misses       5979     5990      +11

Flag	Coverage Δ
buildkite	`68.66% <71.73%> (+0.01%)`	⬆️
julia	`68.66% <71.73%> (+0.01%)`	⬆️
reactant_1	`6.46% <0.00%> (-0.01%)`	⬇️
reactant_2	`11.36% <13.04%> (+0.01%)`	⬆️
reactant_3	`9.76% <36.95%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Mikolaj-A-Kowalski · 2026-02-20T13:39:12Z

Hopefully after this recent push the cuda tests should actually run which should remove the test coverage problem.

Another issue is that currently no run with the netwon_div is done in the benchmarks, should I add/modify some?

Mikolaj-A-Kowalski · 2026-02-27T11:57:04Z

I have cleaned the history and rebased on current main.

@glwagner I think the only potential remaining issues are:

Failing coverage check
The usage of the backend-specific scheme as a default since currently it needs to be explicitly specified by a user

For the failing check my feeling is that it is mostly because there is no coverage data from the GPU, hence the new CUDA-specific division functions are reported as untested despite being covered by unit tests. However, I am not 100% sure I interpret correctly. Perhaps I messed up something in the setup and the GPU newton_div unit tests do not run?

For the change of the default behaviour is is something we want to do? I imagine it will take defining an extra function that will dispatch on the architecture and return a BackendOptimizedDivision or ConvertingDivision.

This reverts commit bd99d5c.

simone-silvestri · 2026-03-28T13:28:21Z

I think there is an issue with FP32

----------------------------------------------------------------------
--
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
ERROR: a error was thrown during kernel execution on thread (225, 1, 1) in block (64, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).
 
[ Info:   Running benchmark...
ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA TITAN V
Stacktrace:
[1] check_exceptions()
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/exceptions.jl:39
[2] device_synchronize(; blocking::Bool, spin::Bool)
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:191
[3] device_synchronize
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:178 [inlined]
[4] checked_cuModuleLoadDataEx(_module::Base.RefValue{Ptr{CUDA.CUmod_st}}, image::Ptr{UInt8}, numOptions::Int64, options::Vector{CUDA.CUjit_option_enum}, optionValues::Vector{Ptr{Nothing}})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:18
[5] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:60
[6] CuModule
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:49 [inlined]
[7] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/compilation.jl:409
[8] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:270
[9] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:159
[10] macro expansion
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/execution.jl:373 [inlined]
[11] macro expansion
@ ./lock.jl:376 [inlined]
[12] cufunction(f::typeof(Oceananigans.Models.HydrostaticFreeSurfaceModels.gpu_compute_hydrostatic_free_surface_Gv!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1749935,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(6836,)}, KernelAbstractions.NDIteration.StaticSize{(256,)}, Oceananigans.Utils.IndexMap, CUDA.CuDeviceVector{Tuple{UInt16, UInt16, UInt16}, 1}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, ImmersedBoundaryGrid{Float32, Periodic, RightCenterFolded, Bounded, OrthogonalSphericalShellGrid{Float32, Periodic, RightCenterFolded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}}, Oceananigans.OrthogonalSphericalShellGrids.Tripolar{Int64, Int64, Int64}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, Nothing, Float32}, PartialCellBottom{Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Float32}, Nothing, Nothing, Nothing}, Tuple{WENOVectorInvariant{5, 5, Float32, false, WENO{5, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{4, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}, Centered{4, Float32, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}}, Oceananigans.Advection.VelocityStencil, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Oceananigans.Advection.OnlySelfUpwinding{Centered{2, Float32, Centered{1, Float32, Nothing}}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.u_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.v_smoothness)}}}, HydrostaticSphericalCoriolis{Oceananigans.Advection.EnstrophyConserving{Float32}, Float32, Oceananigans.Coriolis.HydrostaticFormulation}, CATKEVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEMixingLength{Float64}, Float64, Nothing, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEEquation{Float64}}, Oceananigans.BoundaryConditions.NoFluxBoundaryCondition, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, w::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, SplitExplicitFreeSurface{true, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{U::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, @NamedTuple{η̅::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, U̅::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V̅::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ũ::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ṽ::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, Float32, Nothing, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.FixedSubstepNumber{Float32, NTuple{21, Float32}}, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.ForwardBackwardScheme}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, BuoyancyForce{SeawaterBuoyancy{Float32, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float32}, Float32}, Nothing, Nothing}, Oceananigans.Grids.NegativeZDirection, Nothing}, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEClosureFields{OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::Oceananigans.Fields.ZeroField{Int64, 3}, S::Oceananigans.Fields.ZeroField{Int64, 3}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{}, ZCoordinate, @NamedTuple{time::Float64, last_Δt::Float64, last_stage_Δt::Float64, iteration::Int64, stage::Int64}, Returns{Float32}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
 ```

Mikolaj-A-Kowalski · 2026-03-31T09:09:00Z

I think there is an issue with FP32

If I read it correctly the benchmark name correctly it was running in Float64. There is also a failing unit test for test_utils.jl. Given that it was passing before the merges of main, it may be that they brought some extra changes that break some things.

Edit:
Confirmed that both the benchmark and the test pass on my side for fed538a

Mikolaj-A-Kowalski · 2026-03-31T09:35:40Z

I fear that this one may be tough to solve since I have run both the unit test and the benchmark locally (on fed538a ). They both complete just fine...

Mikolaj-A-Kowalski · 2026-04-01T09:36:45Z

Does the benchmark fail in the same way still? I don't have a permission to see the job output.

simone-silvestri · 2026-04-01T09:53:50Z

I will make the pipeline public so everyone can see the output

simone-silvestri · 2026-04-01T09:56:18Z

Indeed, it fails with

===============================================================================================
--
 
 
----------------------------------------------------------------------
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
ERROR: a error was thrown during kernel execution on thread (97, 1, 1) in block (24, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

Is this reproducible on CPU? Or probably the division paths are different. I would try passing a -g2 to print debug statements

simone-silvestri · 2026-04-01T09:57:13Z

@Mikolaj-A-Kowalski do you have access to the pipeline now?

Mikolaj-A-Kowalski · 2026-04-01T10:02:13Z

@Mikolaj-A-Kowalski do you have access to the pipeline now?

Yes I do. Thanks.

I will have another go at reproducing the failure. Sadly I had bad experience with -g2 as it would put unicode variable names into comments in PTX and crash ptxas... but this was few months ago,

Mikolaj-A-Kowalski · 2026-04-01T10:41:30Z

----------------------------------------------------------------------                                                                                 
Running: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr                                                           
----------------------------------------------------------------------                                                                                 
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...
[ Info:   Running benchmark...
[ Info:   Results:
[ Info:     Total time: 19.676 s
[ Info:     Time per step: 0.196757 s
[ Info:     Grid points/s: 1.65e+07
[ Info:     GPU memory usage: 1011.441 MiB

Still executes fine when I try it on L40 locally.

simone-silvestri · 2026-04-01T10:47:05Z

Is that with FP64?

Mikolaj-A-Kowalski · 2026-04-01T10:50:53Z

Is that with FP64?

Yes. Sorry I based it on the snippet you posted:

----------------------------------------------------------------------
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info:   Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info:   Float type: Float64
[ Info:   Grid size: 360 × 180 × 50 (3240000 points)
[ Info:   Time step: 60.0 s
[ Info:   Warmup steps: 5
[ Info:   Benchmark steps: 100
[ Info:   Running warmup...

I think it contained typo in [ Info: Float type: Float64. I get the failure with Float32, which is good.

There was a type instability and test input was getting promoted to Float64. As a result the Float32 was never verified and a typo in the intrinsic name was not caught ealier.

Mikolaj-A-Kowalski · 2026-04-01T11:49:46Z

It is quite horrifying but there was a type instability in the test and the Float32 case was still running Float64 😬

Hence it did not catch the typo in the LLVM intrinsic.

github-actions · 2026-04-01T15:47:58Z

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark	PR (pts/s)	Main (pts/s)	Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	37131669.116	35415290.875	+4.8%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	25230684.632	26195977.811	-3.7%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	38247442.700	36840867.877	+3.8%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr	49131110.860	50110646.519	-2%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr	68745039.739	65661292.356	+4.7%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr	28365027.916	27332212.838	+3.8%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr	11015141.236	10815017.294	+1.9%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr	67179970.255	65455798.900	+2.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr	45469929.230	42639967.205	+6.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr	26699253.153	25036217.146	+6.6%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	33533826.486	32189917.038	+4.2%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	34755619.123	33293196.104	+4.4%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	34727385.051	33706443.655	+3%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	39296884.482	37453902.170	+4.9%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	37877809.342	36560393.742	+3.6%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr	33283009.009	32399029.587	+2.7%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel	Median (ms)	Main (ms)	Change	Instances	Avg (ms)	Min (ms)	Max (ms)
`gpu__rk_substep_turbulent_kinetic_energy_`	4.013	4.012	+0.0%	315	4.022	4.007	4.355
`gpu_compute_hydrostatic_free_surface_Gu_`	3.738	3.953	-5.4%	318	3.727	2.590	3.872
`gpu_compute_hydrostatic_free_surface_Gv_`	3.680	3.785	-2.8%	318	3.684	2.555	3.980
`gpu_compute_hydrostatic_free_surface_Gc_`	2.331	2.523	-7.6%	315	2.338	2.293	2.658
`gpu_compute_hydrostatic_free_surface_Gc_`	2.308	2.523	-8.5%	315	2.317	2.272	2.615
`gpu_compute_hydrostatic_free_surface_Gc_`	2.297	2.523	-8.9%	315	2.306	2.259	2.633
`gpu_compute_CATKE_closure_fields_`	1.615	1.618	-0.2%	318	1.619	1.608	1.745
`gpu_compute_TKE_diffusivity_`	1.269	1.270	-0.1%	315	1.272	1.264	1.378
`gpu__compute_w_from_continuity_`	0.339	0.340	-0.3%	633	0.339	0.335	0.345
`gpu_solve_batched_tridiagonal_system_kernel_`	0.527	0.526	+0.2%	315	0.527	0.519	0.542

glwagner · 2026-04-01T17:12:54Z

not zero speed up!

simone-silvestri · 2026-04-13T13:54:59Z

@Mikolaj-A-Kowalski want to merge?

github-actions · 2026-04-13T15:28:53Z

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark	PR (pts/s)	Main (pts/s)	Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	42421181.780	40929777.557	+3.6%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	25770151.172	28354992.513	-9.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	43383326.198	41627380.748	+4.2%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr	57596238.489	55857098.415	+3.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr	68459963.001	65172936.802	+5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr	31305831.680	30169289.515	+3.8%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr	11330095.631	11199917.487	+1.2%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr	87163771.337	87049954.749	+0.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr	53937198.007	49816168.222	+8.3%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr	29401174.053	27490822.223	+6.9%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	33289952.083	31976314.693	+4.1%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	39029556.572	37071701.680	+5.3%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	39555094.696	38337157.558	+3.2%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	38997420.947	37259153.682	+4.7%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	42833123.576	41159666.620	+4.1%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr	38494237.090	36844339.754	+4.5%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel	Median (ms)	Main (ms)	Change	Instances	Avg (ms)	Min (ms)	Max (ms)
`gpu_compute_hydrostatic_free_surface_Gu_`	3.731	3.955	-5.7%	318	3.719	2.593	3.874
`gpu_compute_hydrostatic_free_surface_Gv_`	3.688	3.786	-2.6%	318	3.687	2.552	3.972
`gpu_compute_hydrostatic_free_surface_Gc_`	2.333	2.522	-7.5%	315	2.339	2.295	2.680
`gpu_compute_hydrostatic_free_surface_Gc_`	2.308	2.522	-8.5%	315	2.316	2.265	2.638
`gpu_compute_hydrostatic_free_surface_Gc_`	2.301	2.522	-8.8%	315	2.307	2.263	2.613
`gpu__rk_substep_turbulent_kinetic_energy_`	1.989	1.991	-0.1%	315	1.993	1.985	2.173
`gpu_compute_CATKE_closure_fields_`	1.565	1.567	-0.1%	318	1.569	1.558	1.695
`gpu__compute_w_from_continuity_`	0.342	0.342	+0.1%	633	0.342	0.337	0.349
`gpu_compute_TKE_diffusivity_`	0.642	0.643	-0.2%	315	0.643	0.637	0.692
`gpu__compute_split_explicit_transport_velocities_`	0.482	0.492	-1.9%	315	0.482	0.479	0.487

glwagner · 2026-04-13T19:21:10Z

we should merge this before too long

Mikolaj-A-Kowalski · 2026-04-15T08:59:12Z

@Mikolaj-A-Kowalski want to merge?

Sorry for the slight delay. I was thinking it might be better for you to do it. I think the last thing the PR is missing is the minor version bump? Presumably if it is there and the PR is merged the CI will take care of making the release.

Also to keep note our 'fast-path only' version of division has made it to CUDA.jl as the backend for Base.FastMath.div_fast JuliaGPU/CUDA.jl#3077 so we might be able to relay on that from CUDA version 6.1 onwards.

simone-silvestri · 2026-04-15T09:07:49Z

Amazing! Then I'll bump minor version and merge. When CUDA 6.1 is out we can revisit the implementation and rely on the CUDA implementation.

github-actions · 2026-04-15T10:43:35Z

Benchmark Comparison

Benchmark Comparison: PR vs Main

Benchmark	PR (pts/s)	Main (pts/s)	Change
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	42441399.560	40929777.557	+3.7%
EarthOcean_tripolar_180x90x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	24633970.979	28354992.513	-13.1%
EarthOcean_tripolar_720x360x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	43409194.304	41627380.748	+4.3%
EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr	59485337.224	55857098.415	+6.5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_nothing_2tr	68429661.111	65172936.802	+5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+Biharmonic_2tr	31292349.127	30169289.515	+3.7%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE+GM+Biharmonic_2tr	11332231.661	11199917.487	+1.2%
EarthOcean_tripolar_360x180x50_F64_nothing_nothing_CATKE_2tr	86612303.662	87049954.749	-0.5%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant5_WENO5_CATKE_2tr	53906900.321	49816168.222	+8.2%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariant9_WENO9_CATKE_2tr	29404981.649	27490822.223	+7%
EarthOcean_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	33341379.496	31976314.693	+4.3%
EarthOcean_immersed_lat_lon_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	38972744.632	37071701.680	+5.1%
EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	39670798.728	38337157.558	+3.5%
EarthOcean_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	39010991.545	37259153.682	+4.7%
EarthOcean_immersed_lat_lon_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr	42823618.500	41159666.620	+4%
EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_3tr	38482502.919	36844339.754	+4.4%

NSYS Kernel Profiling

EarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr

Kernel	Median (ms)	Main (ms)	Change	Instances	Avg (ms)	Min (ms)	Max (ms)
`gpu_compute_hydrostatic_free_surface_Gu_`	3.740	3.955	-5.4%	318	3.726	2.593	3.880
`gpu_compute_hydrostatic_free_surface_Gv_`	3.699	3.786	-2.3%	318	3.696	2.569	3.973
`gpu_compute_hydrostatic_free_surface_Gc_`	2.331	2.522	-7.6%	315	2.338	2.295	2.668
`gpu_compute_hydrostatic_free_surface_Gc_`	2.308	2.522	-8.5%	315	2.316	2.274	2.651
`gpu_compute_hydrostatic_free_surface_Gc_`	2.300	2.522	-8.8%	315	2.308	2.264	2.618
`gpu__rk_substep_turbulent_kinetic_energy_`	1.991	1.991	+0.0%	315	1.995	1.988	2.176
`gpu_compute_CATKE_closure_fields_`	1.567	1.567	-0.0%	318	1.570	1.558	1.703
`gpu__compute_w_from_continuity_`	0.341	0.342	-0.3%	633	0.341	0.336	0.347
`gpu_compute_TKE_diffusivity_`	0.643	0.643	+0.1%	315	0.644	0.639	0.694
`gpu__compute_split_explicit_transport_velocities_`	0.491	0.492	-0.1%	315	0.491	0.488	0.496

Mikolaj-A-Kowalski mentioned this pull request Jan 14, 2026

Upstream approximate division to Oceananigans Cambridge-ICCS/CLiMA-GPU-Perf#13

Open

4 tasks

glwagner approved these changes Jan 14, 2026

View reviewed changes

giordano reviewed Jan 14, 2026

View reviewed changes

Comment thread ext/OceananigansCUDAExt.jl Outdated

Comment thread ext/OceananigansCUDAExt.jl Outdated

glwagner reviewed Jan 15, 2026

View reviewed changes

Comment thread test/test_cuda.jl Outdated

glwagner reviewed Jan 15, 2026

View reviewed changes

navidcy added numerics 🧮 So things don't blow up and boil the lobsters alive GPU 👾 Where Oceananigans gets its powers from extensions 🧬 labels Jan 15, 2026

giordano reviewed Jan 15, 2026

View reviewed changes

giordano mentioned this pull request Jan 16, 2026

Overlay methods on top of CUDA.jl method table EnzymeAD/Reactant.jl#2131

Open

simone-silvestri added the benchmark performance runs preconfigured benchamarks and spits out timing label Jan 23, 2026

simone-silvestri mentioned this pull request Jan 23, 2026

Move benchmarks from GPU 1 to GPU 0 #5193

Merged

glwagner reviewed Jan 31, 2026

View reviewed changes

Comment thread src/Advection/weno_reconstruction.jl Outdated

glwagner approved these changes Feb 5, 2026

View reviewed changes

giordano added the breaking change 💔 Concerning a change which breaks the API label Feb 6, 2026

glwagner approved these changes Feb 13, 2026

View reviewed changes

Mikolaj-A-Kowalski force-pushed the mak/fast-cuda-div branch 3 times, most recently from 478a761 to 18e7134 Compare February 27, 2026 11:46

giordano added a commit to NumericalEarth/Breeze.jl that referenced this pull request Feb 27, 2026

Test CliMA/Oceananigans.jl#5140

bd99d5c

giordano added a commit to NumericalEarth/Breeze.jl that referenced this pull request Feb 27, 2026

Revert "Test CliMA/Oceananigans.jl#5140"

7eaf593

This reverts commit bd99d5c.

Merge branch 'main' into mak/fast-cuda-div

d535820

Merge branch 'main' into mak/fast-cuda-div

5368248

test: fix newton_div test

4f295ea

There was a type instability and test input was getting promoted to Float64. As a result the Float32 was never verified and a typo in the intrinsic name was not caught ealier.

vchuravy mentioned this pull request Apr 2, 2026

Implement fast_div using fast rcp JuliaGPU/CUDA.jl#3077

Merged

efaulhaber mentioned this pull request Apr 7, 2026

Use fast divisions in performance-critical code trixi-framework/TrixiParticles.jl#1128

Merged

Merge branch 'main' into mak/fast-cuda-div

3c90552

update minor version (numerical difference)

c656070

simone-silvestri merged commit c109ce4 into main Apr 15, 2026
15 of 19 checks passed

simone-silvestri deleted the mak/fast-cuda-div branch April 15, 2026 09:08

navidcy mentioned this pull request Apr 15, 2026

Add neutral atmospheric boundary layer example NumericalEarth/Breeze.jl#615

Merged

Conversation

Mikolaj-A-Kowalski commented Jan 14, 2026

Uh oh!

glwagner commented Jan 14, 2026

Uh oh!

glwagner left a comment

Choose a reason for hiding this comment

Uh oh!

Mikolaj-A-Kowalski commented Jan 14, 2026

Uh oh!

glwagner commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Mikolaj-A-Kowalski commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glwagner commented Feb 5, 2026

Uh oh!

Mikolaj-A-Kowalski commented Feb 5, 2026

Uh oh!

codecov Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Mikolaj-A-Kowalski commented Feb 20, 2026

Uh oh!

Mikolaj-A-Kowalski commented Feb 27, 2026

Uh oh!

simone-silvestri commented Mar 28, 2026

Uh oh!

Mikolaj-A-Kowalski commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikolaj-A-Kowalski commented Mar 31, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 1, 2026

Uh oh!

simone-silvestri commented Apr 1, 2026

Uh oh!

simone-silvestri commented Apr 1, 2026

Uh oh!

simone-silvestri commented Apr 1, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 1, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 1, 2026

Uh oh!

simone-silvestri commented Apr 1, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 1, 2026

Uh oh!

Mikolaj-A-Kowalski commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Benchmark Comparison

Benchmark Comparison: PR vs Main

NSYS Kernel Profiling

Mikolaj-A-Kowalski commented Jan 15, 2026 •

edited

Loading

codecov Bot commented Feb 13, 2026 •

edited

Loading

Mikolaj-A-Kowalski commented Mar 31, 2026 •

edited

Loading