You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use specialised implementation of newton_div on CUDA (#5140)
* feat: use fast division on Nvidia GPUs in newton_div context
Adds specialised implementation of the approximate `newton_div` for
the CUDA backend. It allows to avoid slow-path of the `rcp` and `div`
and provides few per-cent speed-up of advection kernels.
* fix: do not specify the type of numerator
Since we just multiply reciprocal of denominator by the numerator we
don't need to know the exact representation of the numerator.
If specified can lead to unexpected dispatch to the fallback method if
the numerator is e.g. π literal (of type Irrational).
* refactor: select type of newton_div in a WENO scheme by type parameter
* fix: newton_div type propagation into buffer schemes
* test: update doctests
* feat: add CUDA fast division for f32
* refactor: remove lower-precision WENO type parameter
This is a breaking change!
It requires requires minor version bump.
It has been made since the 2nd floating precision type parameter is no
longer required. It has been replaced with type-based specifier for the
division type in WENO reconstruction scheme.
* refactor: use `weight_computation` to refer to division type in WENO
* refactor: make `newton_div` type names less verbose
* test: add unit tests for newton_div
Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>
* refactor: BackendOptimizedDivision in ConvertingDivision on CPU
Add a fallback to enable `BackendOptimizedDivision` to run on the CPU.
The fallback should be overriden for each device backend in an
approperiate extension module.
* refactor: use BackendOptimizedDivision by default
* refactor: use normal division on the CPU
The difference in latency between f32 and f64 is small. It is probably
not enough to justify two conversions and FMAs. (not tested for
performance, it is conjecture)
* refactor: do not use CUDA intrinsics under Reactant
The BackendOptimizedDivision optimization for `weno_weights_computation`
uses LLVM's NVPTX backend intrinsics which are not understood by
Reactant. Thus we need to change the default when running under Reactant.
* feat: add materialize_advection to defer configuration options
To resolve problems with Reactant not knowing about NVPTX intrinsics
that we are using in the backend optimised implementation, we need to
defer a choice of the default weno_weight_computation option until it is
known on what backend the problem will be run.
To do that we can rely on the `materialize` pattern used already in the
similar circumstances.
* feat: use advection materialisation in models
* refactor: get rid of the global weight_computation setting
It is no longer necessary since the default can be assigned in the
materialization of the advection schemes. It can also be dependent on a
specific architecture the problem will run on.
To change the default setting user just needs to override the function
`default_weno_weight_computation(arch)`
* fix: failing reactant tests
* fix: add missing materialize_advection overloads
* test: fix tests broken by changes to the API
* fix: add missing overload for Distributed grid
* test: add missing materialization step to test_forcing
Move the MockGrid to the include file. Multiple test files need it and
we need to make sure it is defined once to avoid struct redefinition.
* fix: extra end-of-file newlines
Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
* test: fix newton_div test
There was a type instability and test input was getting promoted to
Float64. As a result the Float32 was never verified and a typo in the
intrinsic name was not caught ealier.
* update minor version (numerical difference)
---------
Co-authored-by: Gregory L. Wagner <wagner.greg@gmail.com>
Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
0 commit comments