Minimal GPU example which preserves the element routine#1291
Minimal GPU example which preserves the element routine#1291termi-official wants to merge 45 commits intomasterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1291 +/- ##
==========================================
- Coverage 94.45% 89.08% -5.38%
==========================================
Files 40 47 +7
Lines 6905 7410 +505
==========================================
+ Hits 6522 6601 +79
- Misses 383 809 +426 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ed8eb72 to
2ce407b
Compare
fdff415 to
b12d500
Compare
| FunctionValues | ||
|
|
||
| struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues | ||
| struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues |
There was a problem hiding this comment.
This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.
We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.
… integrated as a full test. I will split it up later.
KnutAM
left a comment
There was a problem hiding this comment.
This is really cool and I'm really excited for this feature!
I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?
adaptdispatches forDofHandlerandConstraintHandler- Support for allocation and assembly of
CuSparseMatrixCSC CellCacheContainer,CellValuesContainer, andFacetValuesContainer(assembler?)
(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)
Did I miss any new API?
| cv_gpu = CellValuesContainer(backend, n_workers, cv) | ||
| cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu) |
There was a problem hiding this comment.
As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.
| cv_gpu = CellValuesContainer(backend, n_workers, cv) | |
| cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu) | |
| cv_gpu = distribute_to_workers(backend, n_workers, cv) | |
| cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) |
"""
distribute_to_workers([backend], n_workers, x::Tx)
Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`.
Ferrite provides support for the following `x`s:
* [`CellCache`](@ref)
* [`CellValues`](@ref)
* [`FacetValues`](@ref)
* `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?)
If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented:
* `CUDABackend()` (Via the `CUDA.jl` extension)
"""
function distribute_to_workers endxref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be
distributedistribute_to_tasks
There was a problem hiding this comment.
I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.
For more context:
cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.- if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).
- I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?
There was a problem hiding this comment.
In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.
- ... or we transfer the data back and forth, or the CellCache is just some other placeholder type.
I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.
- I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?
The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.
- ... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?
I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.
|
Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.
From the user-facing API that's it. Then there is right now the internal API with |
Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)
Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation
Yes, but with the proposed struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
num::Int
cv::CV_internal
function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
return new{CV, CV_internal}(cv_internal, num)
end
end
function distribute_to_workers(backend, n_workers, cv::CellValues)
cva = as_structure_of_arrays(backend, n_workers, cv)
CV = typeof(get_substruct(cva, 1))
return DistributedVals{CV}(cva, n_workers)
end
function Base.get_index(d::DistriburedVals, i)
checkbounds(i, 1:d.num)
return get_substruct(d.cv, i)::CV
endThe reason I'm against exposing |
| # NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the | ||
| # immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration. | ||
| struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX} | ||
| flags::UpdateFlags | ||
| grid::G | ||
| cellid::Int | ||
| nodes::IVT | ||
| coords::VX | ||
| sdh::SDH | ||
| dofs::IVT | ||
| end |
There was a problem hiding this comment.
NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
I think we can, by changing
Lines 35 to 46 in 34c457e
mutable struct Mutable{T}
v::T
end
struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}}
flags::UpdateFlags
grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
cellid::CID
nodes::NodeV
coords::CoordV
dh::DH
dofs::DofV
endwhere CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?
This basically re-introduces the old design, which I think makes sense here instead of duplicating?
There was a problem hiding this comment.
I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is
diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl
index cbccc539a..969a9ed75 100644
--- a/ext/FerriteKAExt/adapt_core.jl
+++ b/ext/FerriteKAExt/adapt_core.jl
@@ -1,6 +1,7 @@
# This file contains adapt rules for all relevant data structures in Ferrite.jl which do
# not need customized GPU data structures.
+Adapt.@adapt_structure CellCache
Adapt.@adapt_structure CellValues
Adapt.@adapt_structure Ferrite.GeometryMapping
Adapt.@adapt_structure Ferrite.FunctionValues
diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl
index 3f89ecc4a..d20f7f00f 100644
--- a/ext/FerriteKAExt/iterator.jl
+++ b/ext/FerriteKAExt/iterator.jl
@@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi
N = Ferrite.nnodes_per_cell(sdh)
nodes = KA.zeros(backend, Int, outer_dim, N)
coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N)
+ cellids = KA.zeros(backend, Int, outer_dim, 1)
dofs = KA.zeros(backend, Int, outer_dim, n)
end
- return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs)
+ return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs)
+end
+function get_substruct(i, cc::CellCache, cellid)
+ return CellCache(
+ cc.flags, cc.grid, view(cc.cellid, i, :),
+ view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :)
+ )
+end
+function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int)
+ cc.cellid[1] = cellid # TODO remove this in future versions
+ cc.flags.nodes && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid)
+ cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid)
+ cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid)
+ return cc
end
function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
diff --git a/src/iterators.jl b/src/iterators.jl
index f432455fa..7eb3f244a 100644
--- a/src/iterators.jl
+++ b/src/iterators.jl
@@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh
See also [`CellIterator`](@ref).
"""
-mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
- const flags::UpdateFlags
- const grid::G
+struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}}
+ flags::UpdateFlags
+ grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
- cellid::Int
- const nodes::Vector{Int}
- const coords::Vector{X}
- const dh::DH
- const dofs::Vector{Int}
+ cellid::CIDType
+ nodes::IVType
+ coords::CVType
+ dh::DH
+ dofs::IVType
end
function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T}
N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!`
nodes = zeros(Int, N)
coords = zeros(Vec{dim, T}, N)
- return CellCache(flags, grid, -1, nodes, coords, nothing, Int[])
+ return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[])
end
function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
@@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher
nodes = zeros(Int, N)
coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N)
celldofs = zeros(Int, n)
- return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs)
+ return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs)
end
function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
@@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
end
function reinit!(cc::CellCache, i::Int)
- cc.cellid = i
+ cc.cellid[1] = i # TODO remove this in future versions
if cc.flags.nodes
resize!(cc.nodes, nnodes_per_cell(cc.grid, i))
cellnodes!(cc.nodes, cc.grid, i)
@@ -97,10 +97,10 @@ end
getnodes(cc::CellCache) = cc.nodes
getcoordinates(cc::CellCache) = cc.coords
celldofs(cc::CellCache) = cc.dofs
-cellid(cc::CellCache) = cc.cellid
+cellid(cc::CellCache) = cc.cellid[1]
# TODO: These should really be replaced with something better...
-nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid))
+nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc)))
"""
@@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`.
See also [`FacetIterator`](@ref).
"""
-mutable struct FacetCache{CC <: CellCache}
- const cc::CC # const for julia > 1.8
- const dofs::Vector{Int} # aliasing cc.dofs
- current_facet_id::Int
+struct FacetCache{CC <: CellCache, DVType, CFType}
+ cc::CC
+ dofs::DVType # aliasing cc.dofs
+ current_facet_id::CFType
end
function FacetCache(args...)
cc = CellCache(args...)
- return FacetCache(cc, cc.dofs, 0)
+ return FacetCache(cc, cc.dofs, [0])
end
function reinit!(fc::FacetCache, facet::BoundaryIndex)
cellid, facetid = facet
reinit!(fc.cc, cellid)
- fc.current_facet_id = facetid
+ fc.current_facet_id[1] = facetid
return nothing
end
@@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs)
end
@inline function reinit!(fv::FacetValues, fc::FacetCache)
- return reinit!(fv, fc.cc, fc.current_facet_id)
+ return reinit!(fv, fc.cc, fc.current_facet_id[1])
end
"""
diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl
index 6efd99cdb..6931d0b25 100644
--- a/docs/src/howto/gpu_heat_howto_literate.jl
+++ b/docs/src/howto/gpu_heat_howto_literate.jl
@@ -68,7 +68,8 @@ end
cv_i = cv[task_index]
## Query work item cell cache. The call on the item initializes replaces the reinit! call.
- cc_i = cc[task_index](cellid)
+ cc_i = cc[task_index]
+ Ferrite.reinit!(cc_i, cellid)
## Query assembly buffer.
Ke = view(Kes, i, :, :)If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.
There was a problem hiding this comment.
Note that we cannot do any shenanigans like resizing on the GPU.
KnutAM
left a comment
There was a problem hiding this comment.
This is really cool, but took some time to get into :)
After going through this PR and making changes in #1319, I have the following overall thoughts.
-
I still think one container type should be enough, e.g.
SoAContainer. Whether we should havedistribute_to_tasksI'm not sure about, but it is nice to not be bounded by a given type but rather have a function that creates the suitable container. -
In general I think we need to provide new types for those that we cannot
reinit!as usual then (if the additional memory pressure degrades performance for having a global ref-value as a shared array), which means we need theImmutableCellCacheand equivalentlyImmutableFacetValues. For naming I'm not a real fan since their contents is mutated, so it does give a false sense of safety that they are not mutated, i.e. that givencc::ImmutableCellCahcethencc(1)andcc(2)are indeed different types. But theSoAContainershould then still return the same container-type as what is provided, i.e. we shouldn't give it aCellCacheand get anImmutableCellCachesince these have different usages. -
While the
cc(cellid)is neat, I think it would be better to havecc_i = reinit(cc, cellid)instead. I also don't think we should definereinit!(cc, cellid), since this only does a partial reinitialization which can cause hard-to-catch bugs for users. -
Seems like
dof_rangecannot be used at the moment, so an element routine for coupled problems would need to change. However, that can be fixed by a named tuple, see #1319. -
It is not clear to me how this would work for a grid with mixed cell type since we need type-stability on the GPU. I think we should consider this from the beginning, as this might imply that we need a
SubGridstructure as well? -
Seems like the assembler code and definition of
GPUConstraintHandlershould be moved to the KA-extension? But I haven't had time to go through those parts in details yet. -
For the howto, I think this should be a tutorial, and then limited to a single method. Alternatives, such as a cuda-specific kernel can be put in a howto, or alternatively in a collapsible block. We should really provide a MWE for the users as the first tutorial. Same applies for the matrix-free approach, since we don't use that we shouldn't introduce that here IMO. But we could add a how-to for solving that with iterative solvers instead.
|
Thanks for taking time for a detailed review!
But this is not how Ferrite is designed, so I sticked to it to minimize surprise and stay consistent in the design. Note that we always are explicit about the types. We have
While I agree that the name is indeed not optimal, the last statement is not true.
Agreed on the latter. However, note that your proposal for
If I do not messed up it works. But I get your point. I have adopted a variant where we preserve both dispatches, the by index and the by symbol.
I have no good idea yet how this can be resolved without diverging from the original Ferrite design too much. This part has always been a PITA to me. I always partition the grid upfront (see https://github.com/JuliaHealth/Thunderbolt.jl/blob/001c324fb492422253deef9978a59255c1c94b8a/src/mesh/simple_meshes.jl#L58-L60) to sidestep the issues that the
Yes, but I still need to port these to KernelAbstractions (note that they use CUDA kernels right now).
I do not agree on this point. https://diataxis.fr/how-to-guides/ states
and below
The heat equation tutorial gives the basic competence, while newly added piece teaches them how to do GPU assembly correctly. Does that make sense? |
|
I played a bit around with subdomains, but these are really tricky. If we say each subdofhandler has one subgrid, then we either loose the global indexing, or we need to use a sparse vector, or we have dense vectors with dead elements (e.g. all nodes have index zero). The sparse vector seems to introduce some unnecessary overhead for the potential bounds checks. If we say we have some local indexing, then this could also be slightly confusing for users, as the |
This is how I see the diff, so the SoAContainer doesn't affect the - caches = CellCacheContainer(backend, dh, nworkers)
+ caches = SoAContainer(backend, ImmutableCellCache(dh), nworkers)
cc_task = caches[taskid]::ImmutableCellCache
- cc_i = cc_task(cellid)
+ cc_i = reinit(cc_task, cellid)I don't know a better name though. |
This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.
The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.
Closes #628 .
TODOs
Ferrite.reinit!(cc_i::ImmutableCellCache, cellid::Integer)