-
Notifications
You must be signed in to change notification settings - Fork 275
Use specialised implementation of newton_div on CUDA
#5140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 24 commits
bfd435d
a2ee0a5
0930500
29bfd5d
d5fe7b8
2ab9cf5
bbc4642
ecf4e8e
3cd0d41
e351c16
6a65274
c0c0ed2
506fadf
e025c51
d6588f8
c4adca2
3c8d143
f3a7456
9fe1731
3ab7ec9
4e76efa
db30a4c
fed538a
329632d
d535820
5368248
4f295ea
3c90552
c656070
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -132,4 +132,31 @@ end | |
| @inline UT.sync_device!(::CUDAGPU) = CUDA.synchronize() | ||
| @inline UT.sync_device!(::CUDABackend) = CUDA.synchronize() | ||
|
|
||
| # Use faster versions of `newton_div` on Nvidia GPUs | ||
| CUDA.@device_override UT.newton_div(::Type{UT.BackendOptimizedDivision}, a, b) = a * fast_inv_cuda(b) | ||
|
|
||
| function fast_inv_cuda(a::Float64) | ||
| # Get the approximate reciprocal | ||
| # https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rcp-approx-ftz-f64 | ||
| # This instruction chops off last 32bits of mantissa and computes inverse | ||
| # while treating all subnormal numbers as 0.0 | ||
| # If reciprocal would be subnormal, underflows to 0.0 | ||
| # 32 least significant bits of the result are filled with 0s | ||
| inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a) | ||
|
|
||
| # Approximate the missing 32bits of mantissa with a single cubic iteration | ||
| e = fma(inv_a, -a, 1.0) | ||
| e = fma(e, e, e) | ||
| inv_a = fma(e, inv_a, inv_a) | ||
| return inv_a | ||
| end | ||
|
|
||
| function fast_inv_cuda(a::Float32) | ||
| # This instruction just computes reciprocal flushing subnormals to 0.0 | ||
| # Hence for subnormal inputs it returns Inf | ||
| # For large number whose reciprocal is subnormal it underflows to 0.0 | ||
| inv_a = ccall("llvm.nvvm.rcp.approx.ftz.f32", llvmcall, Float32, (Float32,), a) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this different from the standard CUDA single precision fast reciprocal (i.e the one called when using
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for delay. Missed this comment. They are a bit different.
So the version with calling the intrinsic directly as implemented now uses less instruction (2 |
||
| return inv_a | ||
| end | ||
|
|
||
| end # module OceananigansCUDAExt | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| import Oceananigans: architecture | ||
|
|
||
| """ | ||
| materialize_advection(advection, grid) | ||
|
|
||
| Return a fully materialized advection scheme appropriate for `grid`. | ||
| It exists to allow advection schemes to defer specialising settings until | ||
| additional information about the backend from grid is available. | ||
|
|
||
| For example it allows to set per-backend defaults for WENO weight computation | ||
| setting. | ||
| """ | ||
| materialize_advection(advection, grid) = advection | ||
| materialize_advection(::Nothing, grid) = nothing | ||
| materialize_advection(advection::FluxFormAdvection, grid) = FluxFormAdvection( | ||
| materialize_advection(advection.x, grid), | ||
| materialize_advection(advection.y, grid), | ||
| materialize_advection(advection.z, grid), | ||
| ) | ||
|
|
||
| # Upwinding treatments hold a cross_scheme that may contain deferred WENO weight computation | ||
| materialize_advection(u::OnlySelfUpwinding, grid) = | ||
| OnlySelfUpwinding(materialize_advection(u.cross_scheme, grid), | ||
| u.δU_stencil, u.δV_stencil, u.δu²_stencil, u.δv²_stencil) | ||
|
|
||
| materialize_advection(u::CrossAndSelfUpwinding, grid) = | ||
| CrossAndSelfUpwinding(materialize_advection(u.cross_scheme, grid), | ||
| u.divergence_stencil, u.δu²_stencil, u.δv²_stencil) | ||
|
|
||
| materialize_advection(u::VelocityUpwinding, grid) = | ||
| VelocityUpwinding(materialize_advection(u.cross_scheme, grid)) | ||
|
|
||
| # VectorInvariant wraps multiple sub-schemes; recurse into each | ||
| materialize_advection(vi::VectorInvariant{N,FT,M}, grid) where {N,FT,M} = | ||
| VectorInvariant{N,FT,M}( | ||
| materialize_advection(vi.vorticity_scheme, grid), | ||
| vi.vorticity_stencil, | ||
| materialize_advection(vi.vertical_advection_scheme, grid), | ||
| materialize_advection(vi.kinetic_energy_gradient_scheme, grid), | ||
| materialize_advection(vi.divergence_scheme, grid), | ||
| materialize_advection(vi.upwinding, grid), | ||
| ) | ||
|
|
||
|
|
||
| materialize_advection(weno::WENO{N,FT,WCT}, grid) where {N,FT,WCT} = WENO{N,FT,WCT}( | ||
| weno.bounds, | ||
| materialize_advection(weno.buffer_scheme, grid), | ||
| materialize_advection(weno.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(weno::WENO{N,FT,Nothing}, grid) where {N,FT} = | ||
| WENO{N,FT,default_weno_weight_computation(architecture(grid))}( | ||
| weno.bounds, | ||
| materialize_advection(weno.buffer_scheme, grid), | ||
| materialize_advection(weno.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(scheme::UpwindBiased{N,FT}, grid) where {N,FT} = UpwindBiased{N,FT}( | ||
| materialize_advection(scheme.buffer_scheme, grid), | ||
| materialize_advection(scheme.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(scheme::Centered{N,FT}, grid) where {N,FT} = | ||
| Centered{N,FT}(materialize_advection(scheme.buffer_scheme, grid)) |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,20 +4,20 @@ import Oceananigans | |||||
| ##### Weighted Essentially Non-Oscillatory (WENO) advection scheme | ||||||
| ##### | ||||||
|
|
||||||
| struct WENO{N, FT, FT2, PP, CA, SI} <: AbstractUpwindBiasedAdvectionScheme{N, FT} | ||||||
| struct WENO{N, FT, WCT, PP, CA, SI} <: AbstractUpwindBiasedAdvectionScheme{N, FT} | ||||||
| bounds :: PP | ||||||
| buffer_scheme :: CA | ||||||
| advecting_velocity_scheme :: SI | ||||||
|
|
||||||
| function WENO{N, FT, FT2}(bounds::PP, buffer_scheme::CA, | ||||||
| advecting_velocity_scheme :: SI) where {N, FT, FT2, PP, CA, SI} | ||||||
| function WENO{N, FT, WCT}(bounds::PP, buffer_scheme::CA, | ||||||
| advecting_velocity_scheme :: SI) where {N, FT, WCT, PP, CA, SI} | ||||||
|
|
||||||
| return new{N, FT, FT2, PP, CA, SI}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| return new{N, FT, WCT, PP, CA, SI}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| """ | ||||||
| WENO([FT=Float64, FT2=Float32;] | ||||||
| WENO([FT=Float64;] | ||||||
| order = 5, | ||||||
| bounds = nothing, | ||||||
| minimum_buffer_upwind_order = 3) | ||||||
|
|
@@ -28,11 +28,12 @@ Arguments | |||||
| ========= | ||||||
|
|
||||||
| - `FT`: The floating point type used in the scheme. Default: `Oceananigans.defaults.FloatType` | ||||||
| - `FT2`: The floating point type used in some performance-critical parts of the scheme. Default: `Float32` | ||||||
|
|
||||||
| Keyword arguments | ||||||
| ================= | ||||||
|
|
||||||
| - `weight_computation`: The type of approximate division to used when computing WENO weights. | ||||||
| Default: `Nothing` (deferred; a architecture-dependent default is assigned in | ||||||
| `materialize_advection`) | ||||||
| - `order`: The order of the WENO advection scheme. Default: 5 | ||||||
| - `bounds` (experimental): Whether to use bounds-preserving WENO, which produces a reconstruction | ||||||
| that attempts to restrict a quantity to lie between a `bounds` tuple. | ||||||
|
|
@@ -51,8 +52,8 @@ To build the default 5th-order scheme: | |||||
| julia> using Oceananigans | ||||||
|
|
||||||
| julia> WENO() | ||||||
| WENO{3, Float64, Float32}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Float32}(order=3) | ||||||
| WENO{3, Float64, Nothing}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Nothing}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=4) | ||||||
| ``` | ||||||
|
|
@@ -62,10 +63,10 @@ yet minimally-dissipative advection scheme): | |||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9) | ||||||
| WENO{5, Float64, Float32}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Float32}(order=3) | ||||||
| WENO{5, Float64, Nothing}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Nothing}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
@@ -75,24 +76,34 @@ which uses `Centered(order=2)` as the innermost buffer scheme: | |||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9, minimum_buffer_upwind_order=5) | ||||||
| WENO{5, Float64, Float32}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5) | ||||||
| WENO{5, Float64, Nothing}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9, bounds=(0, 1)) | ||||||
| WENO{5, Float64, Float32}(order=9, bounds=(0.0, 1.0)) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Float32}(order=3, bounds=(0.0, 1.0)) | ||||||
| WENO{5, Float64, Nothing}(order=9, bounds=(0.0, 1.0)) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Nothing}(order=3, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
||||||
| To build a WENO scheme that uses approximate division on a GPU to execute faster: | ||||||
| ```jldoctest weno | ||||||
| julia> WENO(;weight_computation=Oceananigans.Utils.BackendOptimizedDivision) | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| WENO{3, Float64, Oceananigans.Utils.BackendOptimizedDivision}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Oceananigans.Utils.BackendOptimizedDivision}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=4) | ||||||
| ``` | ||||||
| """ | ||||||
| function WENO(FT::DataType=Oceananigans.defaults.FloatType, FT2::DataType=Float32; | ||||||
| function WENO(FT::DataType=Oceananigans.defaults.FloatType; | ||||||
| weight_computation::DataType=Nothing, | ||||||
| order = 5, | ||||||
| buffer_scheme = DecreasingOrderAdvectionScheme(), | ||||||
| bounds = nothing, | ||||||
|
|
@@ -116,20 +127,19 @@ function WENO(FT::DataType=Oceananigans.defaults.FloatType, FT2::DataType=Float3 | |||||
| # At minimum order, switch to Centered scheme | ||||||
| buffer_scheme = Centered(FT; order=2) | ||||||
| else | ||||||
| buffer_scheme = WENO(FT, FT2; order=order-2, bounds, minimum_buffer_upwind_order) | ||||||
| buffer_scheme = WENO(FT; order=order-2, bounds, minimum_buffer_upwind_order, weight_computation) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| N = Int((order + 1) ÷ 2) | ||||||
| return WENO{N, FT, FT2}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| return WENO{N, FT, weight_computation}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| weno_order(::WENO{N}) where N = 2N-1 | ||||||
| Base.eltype(::WENO{N, FT}) where {N, FT} = FT | ||||||
| eltype2(::WENO{N, FT, FT2}) where {N, FT, FT2} = FT2 | ||||||
| Base.summary(a::WENO{N, FT, FT2, Nothing}) where {N, FT, FT2} = string("WENO{$N, $FT, $FT2}(order=", 2N-1, ")") | ||||||
| Base.summary(a::WENO{N, FT, FT2, PP}) where {N, FT, FT2, PP} = string("WENO{$N, $FT, $FT2}(order=", 2N-1, ", bounds=", string(a.bounds), ")") | ||||||
| Base.summary(a::WENO{N, FT, WCT, Nothing}) where {N, FT, WCT} = string("WENO{$N, $FT, $WCT}(order=", 2N-1, ")") | ||||||
| Base.summary(a::WENO{N, FT, WCT, PP}) where {N, FT, WCT, PP} = string("WENO{$N, $FT, $WCT}(order=", 2N-1, ", bounds=", string(a.bounds), ")") | ||||||
|
|
||||||
| function Base.show(io::IO, a::WENO) | ||||||
| print(io, summary(a), '\n') | ||||||
|
|
@@ -145,12 +155,16 @@ function Base.show(io::IO, a::WENO) | |||||
| print(io, "└── advecting_velocity_scheme: ", summary(a.advecting_velocity_scheme)) | ||||||
| end | ||||||
|
|
||||||
| Adapt.adapt_structure(to, scheme::WENO{N, FT, FT2}) where {N, FT, FT2} = | ||||||
| WENO{N, FT, FT2}(Adapt.adapt(to, scheme.bounds), | ||||||
| Adapt.adapt_structure(to, scheme::WENO{N, FT, WCT}) where {N, FT, WCT} = | ||||||
| WENO{N, FT, WCT}(Adapt.adapt(to, scheme.bounds), | ||||||
| Adapt.adapt(to, scheme.buffer_scheme), | ||||||
| Adapt.adapt(to, scheme.advecting_velocity_scheme)) | ||||||
|
|
||||||
| on_architecture(to, scheme::WENO{N, FT, FT2}) where {N, FT, FT2} = | ||||||
| WENO{N, FT, FT2}(on_architecture(to, scheme.bounds), | ||||||
| on_architecture(to, scheme::WENO{N, FT, WCT}) where {N, FT, WCT} = | ||||||
| WENO{N, FT, WCT}(on_architecture(to, scheme.bounds), | ||||||
| on_architecture(to, scheme.buffer_scheme), | ||||||
| on_architecture(to, scheme.advecting_velocity_scheme)) | ||||||
|
|
||||||
| # Select the default WENO weight computation | ||||||
| # Specific backends may override | ||||||
| default_weno_weight_computation(arch) = Oceananigans.Utils.BackendOptimizedDivision | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wsmoses @avik-pal Reactant tests on CPU are understandably failing with https://buildkite.com/clima/oceananigans/builds/28492#019bc195-ae24-48aa-82c6-1e534be018b3/L709
Is there a way to tell Reactant to ignore this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so reactant currently faithfully fully represents the cuda kernel. There is no mechanism for determining if you're in reactant or not within a kernel -- we get the code out of cuda.jl. We could add a mechanism for doing so here: https://github.com/EnzymeAD/Reactant.jl/blob/730804beb548017e819b3589b9f6f793a4f19792/ext/ReactantCUDAExt.jl#L1332. We'd need to make a new gpucompiler job/config/state that looks identical to the cuda one, except with a second method overlay table that itself overlays the cuda overlay table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, alternatively, in the Reactant extension can we use
Reactant.@reactant_overlayfor these methods?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but you have to overlay the whole kernel. reactant_overlay does not apply within a cuda kernel [my comment above was describing what would be required for something within kerenl to be overlayed]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be sufficient to define overlaid methods for UT.newton_div (what's done here for CUDA) for Reactant? Wouldn't that override the overlay here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda kernels are still compiled with normal cuda.jl. the reactant overlay method table only applies for generic code outside of kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok ok, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't Reactant/polygeist/whatever raises this to MLIR just handle that intrinsic as
fast 1/xand then backends can emit fast reciprocals if they want to. (I blame LLVM for having 50 rcp intrinsics that are all different)