Skip to content

Minimal GPU example which preserves the element routine#1291

Open
termi-official wants to merge 45 commits intomasterfrom
do/gpu-mwe
Open

Minimal GPU example which preserves the element routine#1291
termi-official wants to merge 45 commits intomasterfrom
do/gpu-mwe

Conversation

@termi-official
Copy link
Copy Markdown
Member

@termi-official termi-official commented Feb 27, 2026

This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.

The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.

Closes #628 .

TODOs

  • Subdomain support
  • Move device constraint handler to KA extension
  • Remove Ferrite.reinit!(cc_i::ImmutableCellCache, cellid::Integer)

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 58.08581% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.08%. Comparing base (ed2c8a8) to head (f3459a2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
ext/FerriteCudaExt.jl 33.33% 76 Missing ⚠️
ext/FerriteKAExt/iterator.jl 54.54% 20 Missing ⚠️
src/soa_utils.jl 50.00% 16 Missing ⚠️
ext/FerriteKAExt/dof_handler.jl 88.46% 3 Missing ⚠️
ext/FerriteKAExt/soa_core.jl 70.00% 3 Missing ⚠️
src/Grid/grid.jl 83.33% 3 Missing ⚠️
ext/FerriteKAExt/device_grid.jl 66.66% 2 Missing ⚠️
ext/FerriteKAExt/adapt_core.jl 88.88% 1 Missing ⚠️
src/Dofs/ConstraintHandler.jl 95.00% 1 Missing ⚠️
src/FEValues/FacetValues.jl 88.88% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1291      +/-   ##
==========================================
- Coverage   94.45%   89.08%   -5.38%     
==========================================
  Files          40       47       +7     
  Lines        6905     7410     +505     
==========================================
+ Hits         6522     6601      +79     
- Misses        383      809     +426     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/FEValues/common_values.jl Outdated
FunctionValues

struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.

We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.

Comment thread src/Dofs/ConstraintHandler.jl
Comment thread src/soa_utils.jl
Comment thread ext/FerriteCudaExt.jl
@termi-official termi-official marked this pull request as ready for review February 28, 2026 01:24
Copy link
Copy Markdown
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 🚀

Initial round of feedback (excluding the parts regarding the generalized vector types in #1252 ), and didn't look at the example so carefully yet.

Comment thread src/FEValues/common_values.jl Outdated
Comment thread src/Dofs/ConstraintHandler.jl Outdated
Comment thread src/Dofs/ConstraintHandler.jl Outdated
Comment thread src/Dofs/ConstraintHandler.jl Outdated
Comment thread src/Dofs/ConstraintHandler.jl Outdated
Comment thread src/Grid/grid.jl Outdated
Comment thread src/Grid/grid.jl Outdated
Comment thread src/soa_utils.jl
Copy link
Copy Markdown
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and I'm really excited for this feature!

I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?

  1. adapt dispatches for DofHandler and ConstraintHandler
  2. Support for allocation and assembly of CuSparseMatrixCSC
  3. CellCacheContainer, CellValuesContainer, and FacetValuesContainer (assembler?)

(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)

Did I miss any new API?

Comment thread docs/src/howto/gpu_heat_howto_literate.jl
Comment thread docs/src/howto/gpu_heat_howto_literate.jl
Comment on lines +182 to +183
cv_gpu = CellValuesContainer(backend, n_workers, cv)
cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.

Suggested change
cv_gpu = CellValuesContainer(backend, n_workers, cv)
cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)
cv_gpu = distribute_to_workers(backend, n_workers, cv)
cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))
"""
    distribute_to_workers([backend], n_workers, x::Tx)

Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`. 

Ferrite provides support for the following `x`s:

* [`CellCache`](@ref)
* [`CellValues`](@ref)
* [`FacetValues`](@ref)
* `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?)

If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented:

* `CUDABackend()` (Via the `CUDA.jl` extension)

"""
function distribute_to_workers end

xref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be

  • distribute
  • distribute_to_tasks

Copy link
Copy Markdown
Member Author

@termi-official termi-official Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.

For more context:

  1. cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.
  2. if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).
  3. I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Minimal GPU example which preserves the element routine #1291 (comment)

In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.

  1. ... or we transfer the data back and forth, or the CellCache is just some other placeholder type.

I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.

  1. I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?

The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.

  1. ... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.

Comment thread docs/src/howto/gpu_assembly.md
@termi-official
Copy link
Copy Markdown
Member Author

Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.

  1. adapt comes from Adapt.jl and is the recommended way to GPU transfer data structures into the kernels. Essentially it does some switching like CuVector -> CuDeviceVector under the hood. You add this to your structs which will be used inside kernels, so this type of transformation is applied to all fields of your structs recursively (such that we also reach inner structs). So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow. A goal here was specifically to be able to enable user to build their stuff with KernelAbstractions.
  2. Technically speaking the assembly into a CSC matrix would work out of the box with what we have. The blocker here is that the assembler structs are not thread-safe, so I added a hotfix for now by adding a function which directly assembles without the additional sort step. Note that I can also add the CSR format in this PR, as it is significantly faster for parallel applications (i.e. in solvers).
  3. Yes, these three structs are essentially the workhorses here. Given the refactoring that we (and especially you Knut for the CellValues) did over the last year porting most parts became significantly easier than in previous attempts and allows us to reuse many existing element routines out of the box.

From the user-facing API that's it. Then there is right now the internal API with get_substruct and as_structure_of_arrays powering the new user-facing structs. This API is subject to change, but the pattern here should be clear. For the GPU it is really beneficial to have structure of arrays data types -- while on the CPU we usually prefer the array of structs, as we already propose in the threaded assembly tutorial. The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

@KnutAM
Copy link
Copy Markdown
Member

KnutAM commented Mar 6, 2026

(1) So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow.

Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)

(2) Technically speaking the assembly into a CSC matrix would work out of the box with what we have.

Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation distribute_to_workers(backend, n_workers, assembler) (but no need to add here), and this PR adds an assembler that doesn't need to be distributed to workers since it doesn't have caches. So not necessarily for dispatch on start_assemble for matrix type then, but rather it could be explicitly constructed, something like start_assemble(CacheFreeAssembler, K, [f])?

(3) The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

Yes, but with the proposed distribute function that is what we would get. I think we didn't mention that from our slack conversation, but just to have it here:

struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
    num::Int
    cv::CV_internal
    function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
        return new{CV, CV_internal}(cv_internal, num)
    end
end

function distribute_to_workers(backend, n_workers, cv::CellValues)
    cva = as_structure_of_arrays(backend, n_workers, cv)
    CV = typeof(get_substruct(cva, 1))
    return DistributedVals{CV}(cva, n_workers)
end

function Base.get_index(d::DistriburedVals, i)
    checkbounds(i, 1:d.num)
    return get_substruct(d.cv, i)::CV
end

The reason I'm against exposing cvs = as_structure_of_arrays(backend, n_workers, cv) to users, is that then we have e.g. a CellValues object, cvs, that doesn't support the methods defined for CellValues (or even worse, in some cases support them but will silently give the wrong result. As a pragmatic approach to avoid code duplication, I like the idea to (mis)use the structures to create structs of arrays, but users should never work with such adopted structures IMO.

Comment on lines +1 to +11
# NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the
# immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX}
flags::UpdateFlags
grid::G
cellid::Int
nodes::IVT
coords::VX
sdh::SDH
dofs::IVT
end
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.

I think we can, by changing

mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
const flags::UpdateFlags
const grid::G
# Pretty useless to store this since you have it already for the reinit! call, but
# needed for the CellIterator(...) workflow since the user doesn't necessarily control
# the loop order in the cell subset.
cellid::Int
const nodes::Vector{Int}
const coords::Vector{X}
const dh::DH
const dofs::Vector{Int}
end
to

mutable struct Mutable{T}
    v::T
end
struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}}
    flags::UpdateFlags
    grid::G
    # Pretty useless to store this since you have it already for the reinit! call, but
    # needed for the CellIterator(...) workflow since the user doesn't necessarily control
    # the loop order in the cell subset.
    cellid::CID
    nodes::NodeV
    coords::CoordV
    dh::DH
    dofs::DofV
end

where CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?

This basically re-introduces the old design, which I think makes sense here instead of duplicating?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is

diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl
index cbccc539a..969a9ed75 100644
--- a/ext/FerriteKAExt/adapt_core.jl
+++ b/ext/FerriteKAExt/adapt_core.jl
@@ -1,6 +1,7 @@
 # This file contains adapt rules for all relevant data structures in Ferrite.jl which do
 # not need customized GPU data structures.
 
+Adapt.@adapt_structure CellCache
 Adapt.@adapt_structure CellValues
 Adapt.@adapt_structure Ferrite.GeometryMapping
 Adapt.@adapt_structure Ferrite.FunctionValues
diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl
index 3f89ecc4a..d20f7f00f 100644
--- a/ext/FerriteKAExt/iterator.jl
+++ b/ext/FerriteKAExt/iterator.jl
@@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi
         N = Ferrite.nnodes_per_cell(sdh)
         nodes = KA.zeros(backend, Int, outer_dim, N)
         coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N)
+        cellids = KA.zeros(backend, Int, outer_dim, 1)
         dofs = KA.zeros(backend, Int, outer_dim, n)
     end
-    return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs)
+    return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs)
+end
+function get_substruct(i, cc::CellCache, cellid)
+    return CellCache(
+        cc.flags, cc.grid, view(cc.cellid, i, :),
+        view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :)
+    )
+end
+function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int)
+    cc.cellid[1] = cellid # TODO remove this in future versions
+    cc.flags.nodes  && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid)
+    cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid)
+    cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid)
+    return cc
 end
 
 function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}

diff --git a/src/iterators.jl b/src/iterators.jl
index f432455fa..7eb3f244a 100644
--- a/src/iterators.jl
+++ b/src/iterators.jl
@@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh
 
 See also [`CellIterator`](@ref).
 """
-mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
-    const flags::UpdateFlags
-    const grid::G
+struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}}
+    flags::UpdateFlags
+    grid::G
     # Pretty useless to store this since you have it already for the reinit! call, but
     # needed for the CellIterator(...) workflow since the user doesn't necessarily control
     # the loop order in the cell subset.
-    cellid::Int
-    const nodes::Vector{Int}
-    const coords::Vector{X}
-    const dh::DH
-    const dofs::Vector{Int}
+    cellid::CIDType
+    nodes::IVType
+    coords::CVType
+    dh::DH
+    dofs::IVType
 end
 
 function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T}
     N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!`
     nodes = zeros(Int, N)
     coords = zeros(Vec{dim, T}, N)
-    return CellCache(flags, grid, -1, nodes, coords, nothing, Int[])
+    return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[])
 end
 
 function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim}
@@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher
     nodes = zeros(Int, N)
     coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N)
     celldofs = zeros(Int, n)
-    return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs)
+    return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs)
 end
 
 function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
@@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags())
 end
 
 function reinit!(cc::CellCache, i::Int)
-    cc.cellid = i
+    cc.cellid[1] = i # TODO remove this in future versions
     if cc.flags.nodes
         resize!(cc.nodes, nnodes_per_cell(cc.grid, i))
         cellnodes!(cc.nodes, cc.grid, i)
@@ -97,10 +97,10 @@ end
 getnodes(cc::CellCache) = cc.nodes
 getcoordinates(cc::CellCache) = cc.coords
 celldofs(cc::CellCache) = cc.dofs
-cellid(cc::CellCache) = cc.cellid
+cellid(cc::CellCache) = cc.cellid[1]
 
 # TODO: These should really be replaced with something better...
-nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid))
+nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc)))
 
 
 """
@@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`.
 
 See also [`FacetIterator`](@ref).
 """
-mutable struct FacetCache{CC <: CellCache}
-    const cc::CC  # const for julia > 1.8
-    const dofs::Vector{Int} # aliasing cc.dofs
-    current_facet_id::Int
+struct FacetCache{CC <: CellCache, DVType, CFType}
+    cc::CC
+    dofs::DVType # aliasing cc.dofs
+    current_facet_id::CFType
 end
 function FacetCache(args...)
     cc = CellCache(args...)
-    return FacetCache(cc, cc.dofs, 0)
+    return FacetCache(cc, cc.dofs, [0])
 end
 
 function reinit!(fc::FacetCache, facet::BoundaryIndex)
     cellid, facetid = facet
     reinit!(fc.cc, cellid)
-    fc.current_facet_id = facetid
+    fc.current_facet_id[1] = facetid
     return nothing
 end
 
@@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs)
 end
 
 @inline function reinit!(fv::FacetValues, fc::FacetCache)
-    return reinit!(fv, fc.cc, fc.current_facet_id)
+    return reinit!(fv, fc.cc, fc.current_facet_id[1])
 end
 
 """
 
 diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl
index 6efd99cdb..6931d0b25 100644
--- a/docs/src/howto/gpu_heat_howto_literate.jl
+++ b/docs/src/howto/gpu_heat_howto_literate.jl
@@ -68,7 +68,8 @@ end
         cv_i = cv[task_index]
 
         ## Query work item cell cache. The call on the item initializes replaces the reinit! call.
-        cc_i = cc[task_index](cellid)
+        cc_i = cc[task_index]
+        Ferrite.reinit!(cc_i, cellid)
 
         ## Query assembly buffer.
         Ke = view(Kes, i, :, :)

If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we cannot do any shenanigans like resizing on the GPU.

Copy link
Copy Markdown
Member

@KnutAM KnutAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool, but took some time to get into :)

After going through this PR and making changes in #1319, I have the following overall thoughts.

  1. I still think one container type should be enough, e.g. SoAContainer. Whether we should have distribute_to_tasks I'm not sure about, but it is nice to not be bounded by a given type but rather have a function that creates the suitable container.

  2. In general I think we need to provide new types for those that we cannot reinit! as usual then (if the additional memory pressure degrades performance for having a global ref-value as a shared array), which means we need the ImmutableCellCache and equivalently ImmutableFacetValues. For naming I'm not a real fan since their contents is mutated, so it does give a false sense of safety that they are not mutated, i.e. that given cc::ImmutableCellCahce then cc(1) and cc(2) are indeed different types. But the SoAContainer should then still return the same container-type as what is provided, i.e. we shouldn't give it a CellCache and get an ImmutableCellCache since these have different usages.

  3. While the cc(cellid) is neat, I think it would be better to have cc_i = reinit(cc, cellid) instead. I also don't think we should define reinit!(cc, cellid), since this only does a partial reinitialization which can cause hard-to-catch bugs for users.

  4. Seems like dof_range cannot be used at the moment, so an element routine for coupled problems would need to change. However, that can be fixed by a named tuple, see #1319.

  5. It is not clear to me how this would work for a grid with mixed cell type since we need type-stability on the GPU. I think we should consider this from the beginning, as this might imply that we need a SubGrid structure as well?

  6. Seems like the assembler code and definition of GPUConstraintHandler should be moved to the KA-extension? But I haven't had time to go through those parts in details yet.

  7. For the howto, I think this should be a tutorial, and then limited to a single method. Alternatives, such as a cuda-specific kernel can be put in a howto, or alternatively in a collapsible block. We should really provide a MWE for the users as the first tutorial. Same applies for the matrix-free approach, since we don't use that we shouldn't introduce that here IMO. But we could add a how-to for solving that with iterative solvers instead.

Comment thread src/soa_utils.jl
Comment thread docs/src/howto/gpu_heat_howto_literate.jl
Comment thread ext/FerriteCudaExt.jl
Comment thread ext/FerriteKAExt/soa_core.jl Outdated
Comment thread ext/FerriteKAExt/adapt_core.jl Outdated
@termi-official
Copy link
Copy Markdown
Member Author

termi-official commented Apr 16, 2026

Thanks for taking time for a detailed review!

1. I still think one container type should be enough, e.g. `SoAContainer`. Whether we should have `distribute_to_tasks` I'm not sure about, but it is nice to not be bounded by a given type but rather have a function that creates the suitable container.

But this is not how Ferrite is designed, so I sticked to it to minimize surprise and stay consistent in the design. Note that we always are explicit about the types. We have QuadratureRule not create_quadrature_rule, CellValues and FacetValues not create_fevalues, etc.

2. In general I think we need to provide new types for those that we cannot `reinit!` as usual then (if the additional memory pressure degrades performance for having a global ref-value as a shared array), which means we need the `ImmutableCellCache` and equivalently `ImmutableFacetValues`. For naming I'm not a real fan since their contents is mutated, so it does give a false sense of safety that they are not mutated, i.e. that given `cc::ImmutableCellCahce` then `cc(1)` and `cc(2)` are indeed different types. [...]

While I agree that the name is indeed not optimal, the last statement is not true. cc(1) returns an ImmutableCellCache. as_structure_of_arrays returns an ImmutableCellCache and the SOAContainer design it would return a SOAContainer. This is indeed different from CellValues for the technical reasons discussed before. However, I do not understand how using the SOAContainer fixes the reinit! issue.

3. While the `cc(cellid)` is neat, I think it would be better to have `cc_i = reinit(cc, cellid)` instead. I also don't think we should define `reinit!(cc, cellid)`, since this only does a partial reinitialization which can cause hard-to-catch bugs for users.

Agreed on the latter. However, note that your proposal for reinit is a variant of the current design where I decided to overload the function call instead. We can go for the reinit function call to be more consistent tho.

4. Seems like `dof_range` cannot be used at the moment, so an element routine for coupled problems would need to change. However, that can be fixed by a named tuple, see #1319.

If I do not messed up it works. But I get your point. I have adopted a variant where we preserve both dispatches, the by index and the by symbol.

5. It is not clear to me how this would work for a grid with mixed cell type since we need type-stability on the GPU. I think we should consider this from the beginning, as this might imply that we need a `SubGrid` structure as well?

I have no good idea yet how this can be resolved without diverging from the original Ferrite design too much. This part has always been a PITA to me. I always partition the grid upfront (see https://github.com/JuliaHealth/Thunderbolt.jl/blob/001c324fb492422253deef9978a59255c1c94b8a/src/mesh/simple_meshes.jl#L58-L60) to sidestep the issues that the cells array is a potential junk box.

6. Seems like the assembler code and definition of `GPUConstraintHandler` should be moved to the KA-extension? But I haven't had time to go through those parts in details yet.

Yes, but I still need to port these to KernelAbstractions (note that they use CUDA kernels right now).

7. For the howto, I think this should be a tutorial, and then limited to a single method. Alternatives, such as a cuda-specific kernel can be put in a howto, or alternatively in a collapsible block. We should really provide a MWE for the users as the first tutorial. Same applies for the matrix-free approach, since we don't use that we shouldn't introduce that here IMO. But we could add a how-to for solving that with iterative solvers instead.

I do not agree on this point. https://diataxis.fr/how-to-guides/ states

Every how-to guide should answer to a human project, in other words. It should show what the human needs to do, with the tools at hand, to obtain the result they need.

and below

A tutorial’s purpose is to help the pupil acquire basic competence. A how-to guide’s purpose is to help the already-competent user perform a particular task correctly.

The heat equation tutorial gives the basic competence, while newly added piece teaches them how to do GPU assembly correctly. Does that make sense?

@termi-official
Copy link
Copy Markdown
Member Author

I played a bit around with subdomains, but these are really tricky.

If we say each subdofhandler has one subgrid, then we either loose the global indexing, or we need to use a sparse vector, or we have dense vectors with dead elements (e.g. all nodes have index zero). The sparse vector seems to introduce some unnecessary overhead for the potential bounds checks. If we say we have some local indexing, then this could also be slightly confusing for users, as the reinit! call will then run on the local index, but the cellid call should still return the global index for correctness reasons (as users might use the cell index for per-element data).

@KnutAM
Copy link
Copy Markdown
Member

KnutAM commented Apr 21, 2026

  • SoAContainer: The idea of having a SoAContainer, is to have a single type that acts as the AbstractVector with the element type being the structure with views into the arrays, instead of introduce a new type for every item type. So I think of this as having Vector{T} instead of VectorOfCellValues, VectorOfFacetValues

  • distribute_to_tasks(backend, item, n_workers) is about taking an initial item and distributing it to each task in a suitable way. This is separate from the SoAContainer, but would return an SoAContainer if we use that. As noted above, I think this is optional (but provides a nice infrastructure for distributing the items in the "right" way").

  • cc(cellid) - sorry, there was an important typo here :D

    given cc::ImmutableCellCahcethencc(1)andcc(2) are indeed different types.

    I meant to write that cc(1) and cc(2) returns different objects, but that these have references to the same underlying arrays which are modified by the indexing. Which to me seems very surprising for a user. I think it is natural to assume that cc1 = cc(1) and cc2 = cc(2) are independent. Sorry about that confusion!

as_structure_of_arrays returns an ImmutableCellCache and the SOAContainer design it would return a SOAContainer.

This is how I see the diff, so the SoAContainer doesn't affect the reinit, at that point you've already indexed into the container to get the view-representation for the ImmutableCellCache.

- caches = CellCacheContainer(backend, dh, nworkers)
+ caches = SoAContainer(backend, ImmutableCellCache(dh), nworkers)

cc_task = caches[taskid]::ImmutableCellCache

- cc_i = cc_task(cellid)
+ cc_i = reinit(cc_task, cellid)

I don't know a better name though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU assembly example

3 participants