Use specialised implementation of newton_div on CUDA#5140
Use specialised implementation of newton_div on CUDA#5140simone-silvestri merged 29 commits intomainfrom
newton_div on CUDA#5140Conversation
Yes, |
glwagner
left a comment
There was a problem hiding this comment.
clean and simple, so if this gives performance improvements and tests pass, I approve.
|
We may need to pause merging for a bit since I run into a small problem when adding a test. The expression of the type: output_via_f32 .= Oceananigans.Utils.newton_div.(Float32, π, test_input)does seem to ignore the override from Will need to have a look into why it is that. |
it might be because |
Thank you! It was exactly the problem! I have pushed the test and removed the type annotation for the numerator in the CUDA override. |
| @testset "CUDA newton_div" begin | ||
| # Test that error is small for random denominators from a single binade | ||
| Random.seed!(44) | ||
| test_input = CuArray(rand(1024)) .+ 1.0 |
There was a problem hiding this comment.
it could make sense also to add a test that Float32 input behaves as expected
There was a problem hiding this comment.
I have added an extra @testset for Float32 just checking that it predicts a reasonable approximation (i.e. within isapprox).
Although I keep feeling that it is and odd place for it to be since we are not providing any special implementation for Float32 for CUDA (at the moment, to be fair we could do that, currently for Float32 there is still slow-path and extra associated complexity to handle subnormal numbers).
Perhaps I should also add test_newton_div.jl file and make similar tests for the CPU? With the similar permissive tolerance of
| # while treating all subnormal numbers as 0.0 | ||
| # If reciprocal would be subnormal, underflows to 0.0 | ||
| # 32 least significant bits of the result are filled with 0s | ||
| inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a) |
There was a problem hiding this comment.
@wsmoses @avik-pal Reactant tests on CPU are understandably failing with https://buildkite.com/clima/oceananigans/builds/28492#019bc195-ae24-48aa-82c6-1e534be018b3/L709
LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.rcp.approx.ftz.d
Is there a way to tell Reactant to ignore this?
There was a problem hiding this comment.
so reactant currently faithfully fully represents the cuda kernel. There is no mechanism for determining if you're in reactant or not within a kernel -- we get the code out of cuda.jl. We could add a mechanism for doing so here: https://github.com/EnzymeAD/Reactant.jl/blob/730804beb548017e819b3589b9f6f793a4f19792/ext/ReactantCUDAExt.jl#L1332. We'd need to make a new gpucompiler job/config/state that looks identical to the cuda one, except with a second method overlay table that itself overlays the cuda overlay table
There was a problem hiding this comment.
Or, alternatively, in the Reactant extension can we use Reactant.@reactant_overlay for these methods?
There was a problem hiding this comment.
yes but you have to overlay the whole kernel. reactant_overlay does not apply within a cuda kernel [my comment above was describing what would be required for something within kerenl to be overlayed]
There was a problem hiding this comment.
Wouldn't it be sufficient to define overlaid methods for UT.newton_div (what's done here for CUDA) for Reactant? Wouldn't that override the overlay here?
There was a problem hiding this comment.
cuda kernels are still compiled with normal cuda.jl. the reactant overlay method table only applies for generic code outside of kernels.
There was a problem hiding this comment.
Can't Reactant/polygeist/whatever raises this to MLIR just handle that intrinsic as fast 1/x and then backends can emit fast reciprocals if they want to. (I blame LLVM for having 50 rcp intrinsics that are all different)
|
@Mikolaj-A-Kowalski looks like there are some conflicts you may have to resolve 😭 Also a doctest is failing -- I'm happy to take care of that (let me know and I will push fixes to this branch) |
|
No need! I will resolve the conflicts and update failing doctests tomorrow morning. Sadly I didn't have a time this afternoon to see the CI 😅 I was suspecting i might have left something out |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5140 +/- ##
==========================================
- Coverage 74.16% 74.14% -0.02%
==========================================
Files 402 403 +1
Lines 23140 23170 +30
==========================================
+ Hits 17161 17180 +19
- Misses 5979 5990 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hopefully after this recent push the cuda tests should actually run which should remove the test coverage problem. Another issue is that currently no run with the |
478a761 to
18e7134
Compare
|
I have cleaned the history and rebased on current @glwagner I think the only potential remaining issues are:
For the failing check my feeling is that it is mostly because there is no coverage data from the GPU, hence the new CUDA-specific division functions are reported as untested despite being covered by unit tests. However, I am not 100% sure I interpret correctly. Perhaps I messed up something in the setup and the GPU For the change of the default behaviour is is something we want to do? I imagine it will take defining an extra function that will dispatch on the architecture and return a |
This reverts commit bd99d5c.
|
I think there is an issue with FP32 ----------------------------------------------------------------------
--
Running: EarthOcean_tripolar_360x180x50_F32_WENOVectorInvariantDefault_WENO7_CATKE_2tr
----------------------------------------------------------------------
[ Info: Benchmark: EarthOcean_tripolar_zstar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
[ Info: Architecture: GPU{CUDA.CUDAKernels.CUDABackend}(CUDA.CUDAKernels.CUDABackend(false, true))
[ Info: Float type: Float64
[ Info: Grid size: 360 × 180 × 50 (3240000 points)
[ Info: Time step: 60.0 s
[ Info: Warmup steps: 5
[ Info: Benchmark steps: 100
[ Info: Running warmup...
ERROR: a error was thrown during kernel execution on thread (225, 1, 1) in block (64, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).
[ Info: Running benchmark...
ERROR: LoadError: KernelException: exception thrown during kernel execution on device NVIDIA TITAN V
Stacktrace:
[1] check_exceptions()
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/exceptions.jl:39
[2] device_synchronize(; blocking::Bool, spin::Bool)
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:191
[3] device_synchronize
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/synchronization.jl:178 [inlined]
[4] checked_cuModuleLoadDataEx(_module::Base.RefValue{Ptr{CUDA.CUmod_st}}, image::Ptr{UInt8}, numOptions::Int64, options::Vector{CUDA.CUjit_option_enum}, optionValues::Vector{Ptr{Nothing}})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:18
[5] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:60
[6] CuModule
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/lib/cudadrv/module.jl:49 [inlined]
[7] link(job::GPUCompiler.CompilerJob, compiled::@NamedTuple{image::Vector{UInt8}, entry::String})
@ CUDA /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/compilation.jl:409
[8] actual_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:270
[9] cached_compilation(cache::Dict{Any, CUDA.CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
@ GPUCompiler /storage5/buildkite-agent/.julia-7456/packages/GPUCompiler/Yuvf5/src/execution.jl:159
[10] macro expansion
@ /storage5/buildkite-agent/.julia-7456/packages/CUDA/Il00B/src/compiler/execution.jl:373 [inlined]
[11] macro expansion
@ ./lock.jl:376 [inlined]
[12] cufunction(f::typeof(Oceananigans.Models.HydrostaticFreeSurfaceModels.gpu_compute_hydrostatic_free_surface_Gv!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(1749935,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(6836,)}, KernelAbstractions.NDIteration.StaticSize{(256,)}, Oceananigans.Utils.IndexMap, CUDA.CuDeviceVector{Tuple{UInt16, UInt16, UInt16}, 1}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, ImmersedBoundaryGrid{Float32, Periodic, RightCenterFolded, Bounded, OrthogonalSphericalShellGrid{Float32, Periodic, RightCenterFolded, Bounded, Oceananigans.Grids.StaticVerticalDiscretization{OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}, OffsetArrays.OffsetVector{Float32, CUDA.CuDeviceVector{Float32, 1}}}, Oceananigans.OrthogonalSphericalShellGrids.Tripolar{Int64, Int64, Int64}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, OffsetArrays.OffsetMatrix{Float32, CUDA.CuDeviceMatrix{Float32, 1}}, Nothing, Float32}, PartialCellBottom{Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Float32}, Nothing, Nothing, Nothing}, Tuple{WENOVectorInvariant{5, 5, Float32, false, WENO{5, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{4, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}, Centered{4, Float32, Centered{3, Float32, Centered{2, Float32, Centered{1, Float32, Nothing}}}}}, Oceananigans.Advection.VelocityStencil, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, WENO{3, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, WENO{2, Float32, Oceananigans.Utils.BackendOptimizedDivision, Nothing, UpwindBiased{1, Float32, Nothing, Centered{1, Float32, Nothing}}, Centered{1, Float32, Nothing}}, Centered{2, Float32, Centered{1, Float32, Nothing}}}, Oceananigans.Advection.OnlySelfUpwinding{Centered{2, Float32, Centered{1, Float32, Nothing}}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.u_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.v_smoothness)}}}, HydrostaticSphericalCoriolis{Oceananigans.Advection.EnstrophyConserving{Float32}, Float32, Oceananigans.Coriolis.HydrostaticFormulation}, CATKEVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEMixingLength{Float64}, Float64, Nothing, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEEquation{Float64}}, Oceananigans.BoundaryConditions.NoFluxBoundaryCondition, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, w::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, SplitExplicitFreeSurface{true, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{U::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, @NamedTuple{η̅::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, U̅::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, V̅::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ũ::Field{Face, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, Ṽ::Field{Center, Face, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}}, Float32, Nothing, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.FixedSubstepNumber{Float32, NTuple{21, Float32}}, Oceananigans.Models.HydrostaticFreeSurfaceModels.SplitExplicitFreeSurfaces.ForwardBackwardScheme}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, BuoyancyForce{SeawaterBuoyancy{Float32, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float32}, Float32}, Nothing, Nothing}, Oceananigans.Grids.NegativeZDirection, Nothing}, Oceananigans.TurbulenceClosures.TKEBasedVerticalDiffusivities.CATKEClosureFields{OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Field{Center, Center, Nothing, Nothing, Nothing, Nothing, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, Float32, Nothing, Nothing, Nothing}, @NamedTuple{u::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, v::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, S::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}, @NamedTuple{T::Oceananigans.Fields.ZeroField{Int64, 3}, S::Oceananigans.Fields.ZeroField{Int64, 3}, e::OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}}}, OffsetArrays.OffsetArray{Float32, 3, CUDA.CuDeviceArray{Float32, 3, 1}}, @NamedTuple{}, ZCoordinate, @NamedTuple{time::Float64, last_Δt::Float64, last_stage_Δt::Float64, iteration::Int64, stage::Int64}, Returns{Float32}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Int64})
``` |
If I read it correctly the benchmark name correctly it was running in Float64. There is also a failing unit test for Edit: |
|
I fear that this one may be tough to solve since I have run both the unit test and the benchmark locally (on fed538a ). They both complete just fine... |
|
Does the benchmark fail in the same way still? I don't have a permission to see the job output. |
|
I will make the pipeline public so everyone can see the output |
|
Indeed, it fails with Is this reproducible on CPU? Or probably the division paths are different. I would try passing a |
|
@Mikolaj-A-Kowalski do you have access to the pipeline now? |
Yes I do. Thanks. I will have another go at reproducing the failure. Sadly I had bad experience with |
Still executes fine when I try it on L40 locally. |
|
Is that with FP64? |
Yes. Sorry I based it on the snippet you posted: I think it contained typo in |
There was a type instability and test input was getting promoted to Float64. As a result the Float32 was never verified and a typo in the intrinsic name was not caught ealier.
|
It is quite horrifying but there was a type instability in the test and the Hence it did not catch the typo in the LLVM intrinsic. |
Benchmark ComparisonBenchmark Comparison: PR vs Main
NSYS Kernel ProfilingEarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
|
|
not zero speed up! |
|
@Mikolaj-A-Kowalski want to merge? |
Benchmark ComparisonBenchmark Comparison: PR vs Main
NSYS Kernel ProfilingEarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
|
|
we should merge this before too long |
Sorry for the slight delay. I was thinking it might be better for you to do it. I think the last thing the PR is missing is the minor version bump? Presumably if it is there and the PR is merged the CI will take care of making the release. Also to keep note our 'fast-path only' version of division has made it to |
|
Amazing! Then I'll bump minor version and merge. When CUDA 6.1 is out we can revisit the implementation and rely on the CUDA implementation. |
Benchmark ComparisonBenchmark Comparison: PR vs Main
NSYS Kernel ProfilingEarthOcean_tripolar_360x180x50_F64_WENOVectorInvariantDefault_WENO7_CATKE_2tr
|
@jackdfranklin @TomMelt
Replaces the
newton_divon Nvidia GPUs with a custom implementation based on the fast path of the implementation of the division in SASS.It uses the$\epsilon = \mathcal{O}(2^{-20}) $ ) and enhances it with a single cubic (Halley?) iteration.
rcp.approx.ftzto get an initial guess for the reciprocal of the divisor (errorIn the range
[2.2250738585072014e-308; 4.494237123190219e307 )(i.e.floatmin(Float64)to 1st number wherercp.approx.ftzunderflows) it should give quite accurate results. The main source of error being substitutiona/b => a * 1/band few ULPs in the evaluation of the reciprocal.I have not checked the performance benefit on the current
main... but when run on the branch with thebenchmark case from #4882 we get ~7% runtime improvement in the
Guadvection kernel. I attach the reports: reports.tar.gzThe biggest danger with the implementation is that it has not great underflow behaviours. For values$\epsilon$ .
x < floatmin(Float64)it evaluates to NaN (as opposed to Inf). This should not be an issue though in the context wherenewton_divis used as the denominator in these cases seems to include regularisation constantOne open question and todo is adding a test for the implementation. Will
test_cuda.jlbe appropriate place for that?