-
Notifications
You must be signed in to change notification settings - Fork 275
Use specialised implementation of newton_div on CUDA
#5140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
bfd435d
feat: use fast division on Nvidia GPUs in newton_div context
Mikolaj-A-Kowalski a2ee0a5
fix: do not specify the type of numerator
Mikolaj-A-Kowalski 0930500
refactor: select type of newton_div in a WENO scheme by type parameter
Mikolaj-A-Kowalski 29bfd5d
fix: newton_div type propagation into buffer schemes
Mikolaj-A-Kowalski d5fe7b8
test: update doctests
Mikolaj-A-Kowalski 2ab9cf5
feat: add CUDA fast division for f32
Mikolaj-A-Kowalski bbc4642
refactor: remove lower-precision WENO type parameter
Mikolaj-A-Kowalski ecf4e8e
refactor: use `weight_computation` to refer to division type in WENO
Mikolaj-A-Kowalski 3cd0d41
refactor: make `newton_div` type names less verbose
Mikolaj-A-Kowalski e351c16
test: add unit tests for newton_div
Mikolaj-A-Kowalski 6a65274
refactor: BackendOptimizedDivision in ConvertingDivision on CPU
Mikolaj-A-Kowalski c0c0ed2
refactor: use BackendOptimizedDivision by default
Mikolaj-A-Kowalski 506fadf
refactor: use normal division on the CPU
Mikolaj-A-Kowalski e025c51
refactor: do not use CUDA intrinsics under Reactant
Mikolaj-A-Kowalski d6588f8
feat: add materialize_advection to defer configuration options
Mikolaj-A-Kowalski c4adca2
feat: use advection materialisation in models
Mikolaj-A-Kowalski 3c8d143
refactor: get rid of the global weight_computation setting
Mikolaj-A-Kowalski f3a7456
fix: failing reactant tests
Mikolaj-A-Kowalski 9fe1731
fix: add missing materialize_advection overloads
Mikolaj-A-Kowalski 3ab7ec9
test: fix tests broken by changes to the API
Mikolaj-A-Kowalski 4e76efa
fix: add missing overload for Distributed grid
Mikolaj-A-Kowalski db30a4c
test: add missing materialization step to test_forcing
Mikolaj-A-Kowalski fed538a
fix: extra end-of-file newlines
Mikolaj-A-Kowalski 329632d
Merge branch 'main' into mak/fast-cuda-div
simone-silvestri d535820
Merge branch 'main' into mak/fast-cuda-div
simone-silvestri 5368248
Merge branch 'main' into mak/fast-cuda-div
glwagner 4f295ea
test: fix newton_div test
Mikolaj-A-Kowalski 3c90552
Merge branch 'main' into mak/fast-cuda-div
simone-silvestri c656070
update minor version (numerical difference)
simone-silvestri File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| import Oceananigans: architecture | ||
|
|
||
| """ | ||
| materialize_advection(advection, grid) | ||
|
|
||
| Return a fully materialized advection scheme appropriate for `grid`. | ||
| It exists to allow advection schemes to defer specialising settings until | ||
| additional information about the backend from grid is available. | ||
|
|
||
| For example it allows to set per-backend defaults for WENO weight computation | ||
| setting. | ||
| """ | ||
| materialize_advection(advection, grid) = advection | ||
| materialize_advection(::Nothing, grid) = nothing | ||
| materialize_advection(advection::FluxFormAdvection, grid) = FluxFormAdvection( | ||
| materialize_advection(advection.x, grid), | ||
| materialize_advection(advection.y, grid), | ||
| materialize_advection(advection.z, grid), | ||
| ) | ||
|
|
||
| # Upwinding treatments hold a cross_scheme that may contain deferred WENO weight computation | ||
| materialize_advection(u::OnlySelfUpwinding, grid) = | ||
| OnlySelfUpwinding(materialize_advection(u.cross_scheme, grid), | ||
| u.δU_stencil, u.δV_stencil, u.δu²_stencil, u.δv²_stencil) | ||
|
|
||
| materialize_advection(u::CrossAndSelfUpwinding, grid) = | ||
| CrossAndSelfUpwinding(materialize_advection(u.cross_scheme, grid), | ||
| u.divergence_stencil, u.δu²_stencil, u.δv²_stencil) | ||
|
|
||
| materialize_advection(u::VelocityUpwinding, grid) = | ||
| VelocityUpwinding(materialize_advection(u.cross_scheme, grid)) | ||
|
|
||
| # VectorInvariant wraps multiple sub-schemes; recurse into each | ||
| materialize_advection(vi::VectorInvariant{N,FT,M}, grid) where {N,FT,M} = | ||
| VectorInvariant{N,FT,M}( | ||
| materialize_advection(vi.vorticity_scheme, grid), | ||
| vi.vorticity_stencil, | ||
| materialize_advection(vi.vertical_advection_scheme, grid), | ||
| materialize_advection(vi.kinetic_energy_gradient_scheme, grid), | ||
| materialize_advection(vi.divergence_scheme, grid), | ||
| materialize_advection(vi.upwinding, grid), | ||
| ) | ||
|
|
||
|
|
||
| materialize_advection(weno::WENO{N,FT,WCT}, grid) where {N,FT,WCT} = WENO{N,FT,WCT}( | ||
| weno.bounds, | ||
| materialize_advection(weno.buffer_scheme, grid), | ||
| materialize_advection(weno.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(weno::WENO{N,FT,Nothing}, grid) where {N,FT} = | ||
| WENO{N,FT,default_weno_weight_computation(architecture(grid))}( | ||
| weno.bounds, | ||
| materialize_advection(weno.buffer_scheme, grid), | ||
| materialize_advection(weno.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(scheme::UpwindBiased{N,FT}, grid) where {N,FT} = UpwindBiased{N,FT}( | ||
| materialize_advection(scheme.buffer_scheme, grid), | ||
| materialize_advection(scheme.advecting_velocity_scheme, grid), | ||
| ) | ||
|
|
||
| materialize_advection(scheme::Centered{N,FT}, grid) where {N,FT} = | ||
| Centered{N,FT}(materialize_advection(scheme.buffer_scheme, grid)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -4,20 +4,20 @@ import Oceananigans | |||||
| ##### Weighted Essentially Non-Oscillatory (WENO) advection scheme | ||||||
| ##### | ||||||
|
|
||||||
| struct WENO{N, FT, FT2, PP, CA, SI} <: AbstractUpwindBiasedAdvectionScheme{N, FT} | ||||||
| struct WENO{N, FT, WCT, PP, CA, SI} <: AbstractUpwindBiasedAdvectionScheme{N, FT} | ||||||
| bounds :: PP | ||||||
| buffer_scheme :: CA | ||||||
| advecting_velocity_scheme :: SI | ||||||
|
|
||||||
| function WENO{N, FT, FT2}(bounds::PP, buffer_scheme::CA, | ||||||
| advecting_velocity_scheme :: SI) where {N, FT, FT2, PP, CA, SI} | ||||||
| function WENO{N, FT, WCT}(bounds::PP, buffer_scheme::CA, | ||||||
| advecting_velocity_scheme :: SI) where {N, FT, WCT, PP, CA, SI} | ||||||
|
|
||||||
| return new{N, FT, FT2, PP, CA, SI}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| return new{N, FT, WCT, PP, CA, SI}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| """ | ||||||
| WENO([FT=Float64, FT2=Float32;] | ||||||
| WENO([FT=Float64;] | ||||||
| order = 5, | ||||||
| bounds = nothing, | ||||||
| minimum_buffer_upwind_order = 3) | ||||||
|
|
@@ -28,11 +28,12 @@ Arguments | |||||
| ========= | ||||||
|
|
||||||
| - `FT`: The floating point type used in the scheme. Default: `Oceananigans.defaults.FloatType` | ||||||
| - `FT2`: The floating point type used in some performance-critical parts of the scheme. Default: `Float32` | ||||||
|
|
||||||
| Keyword arguments | ||||||
| ================= | ||||||
|
|
||||||
| - `weight_computation`: The type of approximate division to used when computing WENO weights. | ||||||
| Default: `Nothing` (deferred; a architecture-dependent default is assigned in | ||||||
| `materialize_advection`) | ||||||
| - `order`: The order of the WENO advection scheme. Default: 5 | ||||||
| - `bounds` (experimental): Whether to use bounds-preserving WENO, which produces a reconstruction | ||||||
| that attempts to restrict a quantity to lie between a `bounds` tuple. | ||||||
|
|
@@ -51,8 +52,8 @@ To build the default 5th-order scheme: | |||||
| julia> using Oceananigans | ||||||
|
|
||||||
| julia> WENO() | ||||||
| WENO{3, Float64, Float32}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Float32}(order=3) | ||||||
| WENO{3, Float64, Nothing}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Nothing}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=4) | ||||||
| ``` | ||||||
|
|
@@ -62,10 +63,10 @@ yet minimally-dissipative advection scheme): | |||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9) | ||||||
| WENO{5, Float64, Float32}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Float32}(order=3) | ||||||
| WENO{5, Float64, Nothing}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Nothing}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
@@ -75,24 +76,34 @@ which uses `Centered(order=2)` as the innermost buffer scheme: | |||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9, minimum_buffer_upwind_order=5) | ||||||
| WENO{5, Float64, Float32}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5) | ||||||
| WENO{5, Float64, Nothing}(order=9) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
||||||
| ```jldoctest weno | ||||||
| julia> WENO(order=9, bounds=(0, 1)) | ||||||
| WENO{5, Float64, Float32}(order=9, bounds=(0.0, 1.0)) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Float32}(order=7, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Float32}(order=5, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Float32}(order=3, bounds=(0.0, 1.0)) | ||||||
| WENO{5, Float64, Nothing}(order=9, bounds=(0.0, 1.0)) | ||||||
| ├── buffer_scheme: WENO{4, Float64, Nothing}(order=7, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{3, Float64, Nothing}(order=5, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: WENO{2, Float64, Nothing}(order=3, bounds=(0.0, 1.0)) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=8) | ||||||
| ``` | ||||||
|
|
||||||
| To build a WENO scheme that uses approximate division on a GPU to execute faster: | ||||||
| ```jldoctest weno | ||||||
| julia> WENO(;weight_computation=Oceananigans.Utils.BackendOptimizedDivision) | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| WENO{3, Float64, Oceananigans.Utils.BackendOptimizedDivision}(order=5) | ||||||
| ├── buffer_scheme: WENO{2, Float64, Oceananigans.Utils.BackendOptimizedDivision}(order=3) | ||||||
| │ └── buffer_scheme: Centered(order=2) | ||||||
| └── advecting_velocity_scheme: Centered(order=4) | ||||||
| ``` | ||||||
| """ | ||||||
| function WENO(FT::DataType=Oceananigans.defaults.FloatType, FT2::DataType=Float32; | ||||||
| function WENO(FT::DataType=Oceananigans.defaults.FloatType; | ||||||
| weight_computation::DataType=Nothing, | ||||||
| order = 5, | ||||||
| buffer_scheme = DecreasingOrderAdvectionScheme(), | ||||||
| bounds = nothing, | ||||||
|
|
@@ -116,20 +127,19 @@ function WENO(FT::DataType=Oceananigans.defaults.FloatType, FT2::DataType=Float3 | |||||
| # At minimum order, switch to Centered scheme | ||||||
| buffer_scheme = Centered(FT; order=2) | ||||||
| else | ||||||
| buffer_scheme = WENO(FT, FT2; order=order-2, bounds, minimum_buffer_upwind_order) | ||||||
| buffer_scheme = WENO(FT; order=order-2, bounds, minimum_buffer_upwind_order, weight_computation) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| N = Int((order + 1) ÷ 2) | ||||||
| return WENO{N, FT, FT2}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| return WENO{N, FT, weight_computation}(bounds, buffer_scheme, advecting_velocity_scheme) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| weno_order(::WENO{N}) where N = 2N-1 | ||||||
| Base.eltype(::WENO{N, FT}) where {N, FT} = FT | ||||||
| eltype2(::WENO{N, FT, FT2}) where {N, FT, FT2} = FT2 | ||||||
| Base.summary(a::WENO{N, FT, FT2, Nothing}) where {N, FT, FT2} = string("WENO{$N, $FT, $FT2}(order=", 2N-1, ")") | ||||||
| Base.summary(a::WENO{N, FT, FT2, PP}) where {N, FT, FT2, PP} = string("WENO{$N, $FT, $FT2}(order=", 2N-1, ", bounds=", string(a.bounds), ")") | ||||||
| Base.summary(a::WENO{N, FT, WCT, Nothing}) where {N, FT, WCT} = string("WENO{$N, $FT, $WCT}(order=", 2N-1, ")") | ||||||
| Base.summary(a::WENO{N, FT, WCT, PP}) where {N, FT, WCT, PP} = string("WENO{$N, $FT, $WCT}(order=", 2N-1, ", bounds=", string(a.bounds), ")") | ||||||
|
|
||||||
| function Base.show(io::IO, a::WENO) | ||||||
| print(io, summary(a), '\n') | ||||||
|
|
@@ -145,12 +155,16 @@ function Base.show(io::IO, a::WENO) | |||||
| print(io, "└── advecting_velocity_scheme: ", summary(a.advecting_velocity_scheme)) | ||||||
| end | ||||||
|
|
||||||
| Adapt.adapt_structure(to, scheme::WENO{N, FT, FT2}) where {N, FT, FT2} = | ||||||
| WENO{N, FT, FT2}(Adapt.adapt(to, scheme.bounds), | ||||||
| Adapt.adapt_structure(to, scheme::WENO{N, FT, WCT}) where {N, FT, WCT} = | ||||||
| WENO{N, FT, WCT}(Adapt.adapt(to, scheme.bounds), | ||||||
| Adapt.adapt(to, scheme.buffer_scheme), | ||||||
| Adapt.adapt(to, scheme.advecting_velocity_scheme)) | ||||||
|
|
||||||
| on_architecture(to, scheme::WENO{N, FT, FT2}) where {N, FT, FT2} = | ||||||
| WENO{N, FT, FT2}(on_architecture(to, scheme.bounds), | ||||||
| on_architecture(to, scheme::WENO{N, FT, WCT}) where {N, FT, WCT} = | ||||||
| WENO{N, FT, WCT}(on_architecture(to, scheme.bounds), | ||||||
| on_architecture(to, scheme.buffer_scheme), | ||||||
| on_architecture(to, scheme.advecting_velocity_scheme)) | ||||||
|
|
||||||
| # Select the default WENO weight computation | ||||||
| # Specific backends may override | ||||||
| default_weno_weight_computation(arch) = Oceananigans.Utils.BackendOptimizedDivision | ||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wsmoses @avik-pal Reactant tests on CPU are understandably failing with https://buildkite.com/clima/oceananigans/builds/28492#019bc195-ae24-48aa-82c6-1e534be018b3/L709
Is there a way to tell Reactant to ignore this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so reactant currently faithfully fully represents the cuda kernel. There is no mechanism for determining if you're in reactant or not within a kernel -- we get the code out of cuda.jl. We could add a mechanism for doing so here: https://github.com/EnzymeAD/Reactant.jl/blob/730804beb548017e819b3589b9f6f793a4f19792/ext/ReactantCUDAExt.jl#L1332. We'd need to make a new gpucompiler job/config/state that looks identical to the cuda one, except with a second method overlay table that itself overlays the cuda overlay table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, alternatively, in the Reactant extension can we use
Reactant.@reactant_overlayfor these methods?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but you have to overlay the whole kernel. reactant_overlay does not apply within a cuda kernel [my comment above was describing what would be required for something within kerenl to be overlayed]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be sufficient to define overlaid methods for UT.newton_div (what's done here for CUDA) for Reactant? Wouldn't that override the overlay here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda kernels are still compiled with normal cuda.jl. the reactant overlay method table only applies for generic code outside of kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok ok, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't Reactant/polygeist/whatever raises this to MLIR just handle that intrinsic as
fast 1/xand then backends can emit fast reciprocals if they want to. (I blame LLVM for having 50 rcp intrinsics that are all different)