Minimal GPU example which preserves the element routine by termi-official · Pull Request #1291 · Ferrite-FEM/Ferrite.jl

termi-official · 2026-02-27T01:09:58Z

This PR has a fundamentally different design from the previous PRs. Here I try to preserve the assembly routine as-is and modify the assembly loop around them. This PR is really just an MWE to show the direction for discussion purposes.

The main vehicle for this PR is a SOA transformation of the CellCache and CellValues objects. I actually like it, but it might not be the best one performance-wise.

Closes #628 .

TODOs

Subdomain support
Move device constraint handler to KA extension
Remove Ferrite.reinit!(cc_i::ImmutableCellCache, cellid::Integer)

codecov · 2026-02-27T02:32:17Z

Codecov Report

❌ Patch coverage is 58.08581% with 127 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.08%. Comparing base (ed2c8a8) to head (f3459a2).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
ext/FerriteCudaExt.jl	33.33%	76 Missing ⚠️
ext/FerriteKAExt/iterator.jl	54.54%	20 Missing ⚠️
src/soa_utils.jl	50.00%	16 Missing ⚠️
ext/FerriteKAExt/dof_handler.jl	88.46%	3 Missing ⚠️
ext/FerriteKAExt/soa_core.jl	70.00%	3 Missing ⚠️
src/Grid/grid.jl	83.33%	3 Missing ⚠️
ext/FerriteKAExt/device_grid.jl	66.66%	2 Missing ⚠️
ext/FerriteKAExt/adapt_core.jl	88.88%	1 Missing ⚠️
src/Dofs/ConstraintHandler.jl	95.00%	1 Missing ⚠️
src/FEValues/FacetValues.jl	88.88%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1291      +/-   ##
==========================================
- Coverage   94.45%   89.08%   -5.38%     
==========================================
  Files          40       47       +7     
  Lines        6905     7410     +505     
==========================================
+ Hits         6522     6601      +79     
- Misses        383      809     +426

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…tter align with Ferrite internals.

…tension.

termi-official · 2026-02-28T01:19:48Z

 FunctionValues

-struct FunctionValues{DiffOrder, IP, N_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues
+struct FunctionValues{DiffOrder, IP, Nx_t, Nξ_t, dNdx_t, dNdξ_t, d2Ndx2_t, d2Ndξ2_t} <: AbstractValues


This is the only intrusive part of this PR here. I have decided to allow the arrays to be stored here directly to not add any helper structs just mirroring the fields as Arrays. Although one should not that passing an array here breaks almost all calls on the struct.

We can also explore the path of having helper structs and introduce some syntactic sugar to query the parts via e.g. fv[i] instead of get_substruct.

… integrated as a full test. I will split it up later.

KnutAM

Cool 🚀

Initial round of feedback (excluding the parts regarding the generalized vector types in #1252 ), and didn't look at the example so carefully yet.

… messages.

… reivew.

KnutAM

This is really cool and I'm really excited for this feature!

I've now focused on the how-to to get a bit understanding from how this would look from a user perspective. So on the user-facing side we are introducing the APIs?

adapt dispatches for DofHandler and ConstraintHandler
Support for allocation and assembly of CuSparseMatrixCSC
CellCacheContainer, CellValuesContainer, and FacetValuesContainer (assembler?)

(Where my suggestion is to replace nr 3 with distribute_to_workers or similar)

Did I miss any new API?

KnutAM · 2026-03-05T21:57:16Z

+cv_gpu = CellValuesContainer(backend, n_workers, cv)
+cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)


As discussed on slack, I don't think we should all all these different versions/types as user-facing, but provide an API that can be overloaded with the specific type, which returns an AbstractArray where the element type is similar to the type of the argument. Putting it here for others to comment to.

Suggested change

cv_gpu = CellValuesContainer(backend, n_workers, cv)

cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)

cv_gpu = distribute_to_workers(backend, n_workers, cv)

cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu))

""" distribute_to_workers([backend], n_workers, x::Tx) Create an `AbstractArray{T}` with length `n_workers` where each item is can be mutated without race conditions between different workers / tasks. While `T == Tx` is not guaranteed, instances of `T` will typically support the same functionality as the original `x`. Ferrite provides support for the following `x`s: * [`CellCache`](@ref) * [`CellValues`](@ref) * [`FacetValues`](@ref) * `AbstractAssembler` (I assume it will be needed due to the `#FIXME buffers` in the code (also no ref as no docs for this one I think?) If no `backend` is provided, `T == Tx`, which is suitable for standard multithreading. Distributing for GPUs requires loading the relevant GPU package that provides a `backend`, currently, the following backends are implemented: * `CUDABackend()` (Via the `CUDA.jl` extension) """ function distribute_to_workers end

xref: #1070 (CC @fredrikekre) for the non-GPU case, which will be solved by this. Of course the same issue as there with name, some alternatives to the above could be

distribute

distribute_to_tasks

I am still not sure about this design, as I also started with it, but it just didn't turn out nicely so far.

For more context:

cc_gpu = distribute_to_workers(backend, n_workers, CellCache(dh_gpu)) does not work so easily. First, the Vector types are hardcoded, so the CellCache(gpu_sdh) constructor would already return something that is not a CellCache, or we transfer the data back and forth, or the CellCache is just some other placeholder type.

if we would widen the type here, it is still mutable and we cannot make the struct immutable without breaking public API (reinit! in CellCache).

I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible? And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

Minimal GPU example which preserves the element routine #1291 (comment)

In general, the way I read the tutorial is the approach for many objects are (1) setup as usual (e.g. Grid, DofHandler, ConstraintHandler, etc.) and (2) adapt to GPU device. So having the distribute to worker interface is just exactly doing that. The alternative, having a new public type for each item (CellValues, FacetValues, MultiFieldCellValues, CellCache, FacetCache, CSCAssembler, CSRAssembler), potentially different between backends, compared to abstracting this away seems more attractive to me wrt. what we define and the number of objects the user has to deal with.

... or we transfer the data back and forth, or the CellCache is just some other placeholder type.

I think this transfer is OK, since it is only during setup. And this follows the same logic of adapt in my view.

I cannot guarantee this signature will work for all that we need. How would you, for example, handle the assembler, as the existing ones are also not GPU compatible?

The new assembler should be documented as thread-safe (as long as coloring is used), so no need to distribute it to workers, can be used by all workers.

... And if we make it GPU compatible, why would I not directly construct them with the correct buffer sizes?

I would argue for the simplicity of the user interface, same pattern everywhere: Create as standard, adapt to GPU backend. Items with buffers/cache that are modified must be distributed to workers, just as with multithreading, and we provide a unified infrastructure to do this via distribute_to_workers.

termi-official · 2026-03-06T13:36:46Z

Thanks for the review Knut! Let me answer your questions in order first before I get to the comments.

adapt comes from Adapt.jl and is the recommended way to GPU transfer data structures into the kernels. Essentially it does some switching like CuVector -> CuDeviceVector under the hood. You add this to your structs which will be used inside kernels, so this type of transformation is applied to all fields of your structs recursively (such that we also reach inner structs). So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow. A goal here was specifically to be able to enable user to build their stuff with KernelAbstractions.
Technically speaking the assembly into a CSC matrix would work out of the box with what we have. The blocker here is that the assembler structs are not thread-safe, so I added a hotfix for now by adding a function which directly assembles without the additional sort step. Note that I can also add the CSR format in this PR, as it is significantly faster for parallel applications (i.e. in solvers).
Yes, these three structs are essentially the workhorses here. Given the refactoring that we (and especially you Knut for the CellValues) did over the last year porting most parts became significantly easier than in previous attempts and allows us to reuse many existing element routines out of the box.

From the user-facing API that's it. Then there is right now the internal API with get_substruct and as_structure_of_arrays powering the new user-facing structs. This API is subject to change, but the pattern here should be clear. For the GPU it is really beneficial to have structure of arrays data types -- while on the CPU we usually prefer the array of structs, as we already propose in the threaded assembly tutorial. The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

KnutAM · 2026-03-06T18:22:48Z

(1) So this is not really a new API which we introduce here, but more a way to add compatibility with the standard Julia GPU workflow.

Yes, I think that makes perfect sense! (Just API in the sense that the Ferrite dispatches is part of the documented API)

(2) Technically speaking the assembly into a CSC matrix would work out of the box with what we have.

Aha, so in that case shouldn't we have the same logic here as for the cell values and cache? (in my notation distribute_to_workers(backend, n_workers, assembler) (but no need to add here), and this PR adds an assembler that doesn't need to be distributed to workers since it doesn't have caches. So not necessarily for dispatch on start_assemble for matrix type then, but rather it could be explicitly constructed, something like start_assemble(CacheFreeAssembler, K, [f])?

(3) The difference here in "optimal" data structures stems from the significant differences in hardware architecture. Does that make sense?

Yes, but with the proposed distribute function that is what we would get. I think we didn't mention that from our slack conversation, but just to have it here:

struct DistributedVals{CV, CV_internal} <: AbstractVector{CV}
    num::Int
    cv::CV_internal
    function DistributedVals{CV}(cv_internal::CV_internal, num::Int) where {CV, CV_internal}
        return new{CV, CV_internal}(cv_internal, num)
    end
end

function distribute_to_workers(backend, n_workers, cv::CellValues)
    cva = as_structure_of_arrays(backend, n_workers, cv)
    CV = typeof(get_substruct(cva, 1))
    return DistributedVals{CV}(cva, n_workers)
end

function Base.get_index(d::DistriburedVals, i)
    checkbounds(i, 1:d.num)
    return get_substruct(d.cv, i)::CV
end

The reason I'm against exposing cvs = as_structure_of_arrays(backend, n_workers, cv) to users, is that then we have e.g. a CellValues object, cvs, that doesn't support the methods defined for CellValues (or even worse, in some cases support them but will silently give the wrong result. As a pragmatic approach to avoid code duplication, I like the idea to (mis)use the structures to create structs of arrays, but users should never work with such adopted structures IMO.

KnutAM · 2026-03-06T18:38:29Z

+# NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the
+# immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.
+struct ImmutableCellCache{G <: AbstractGrid, SDH, IVT, VX}
+    flags::UpdateFlags
+    grid::G
+    cellid::Int
+    nodes::IVT
+    coords::VX
+    sdh::SDH
+    dofs::IVT
+end


NOTE CellCache is mutable and hence inherently incompatible with GPU. So here is the immutable variant. Making the CellCache immutable is considered breaking due to the reinit! API integration.

I think we can, by changing

Ferrite.jl/src/iterators.jl

Lines 35 to 46 in 34c457e

mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}

const flags::UpdateFlags

const grid::G

# Pretty useless to store this since you have it already for the reinit! call, but

# needed for the CellIterator(...) workflow since the user doesn't necessarily control

# the loop order in the cell subset.

cellid::Int

const nodes::Vector{Int}

const coords::Vector{X}

const dh::DH

const dofs::Vector{Int}

end

to

mutable struct Mutable{T} v::T end struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CID, NodeV <: AbstractVector{<:Integer}, CoordV <: AbstractVector{X}, DofV <: AbstractVector{<:Integer}} flags::UpdateFlags grid::G # Pretty useless to store this since you have it already for the reinit! call, but # needed for the CellIterator(...) workflow since the user doesn't necessarily control # the loop order in the cell subset. cellid::CID nodes::NodeV coords::CoordV dh::DH dofs::DofV end

where CID normally is a Mutable{Int}, but for gpu this can be adapt:ed to a view to a CuVector?

This basically re-introduces the old design, which I think makes sense here instead of duplicating?

I would go for the duplication here. Having everything funneled into a single data structure seems to be quite painful to maintain. My working patch here is

diff --git a/ext/FerriteKAExt/adapt_core.jl b/ext/FerriteKAExt/adapt_core.jl index cbccc539a..969a9ed75 100644 --- a/ext/FerriteKAExt/adapt_core.jl +++ b/ext/FerriteKAExt/adapt_core.jl @@ -1,6 +1,7 @@ # This file contains adapt rules for all relevant data structures in Ferrite.jl which do # not need customized GPU data structures. +Adapt.@adapt_structure CellCache Adapt.@adapt_structure CellValues Adapt.@adapt_structure Ferrite.GeometryMapping Adapt.@adapt_structure Ferrite.FunctionValues diff --git a/ext/FerriteKAExt/iterator.jl b/ext/FerriteKAExt/iterator.jl index 3f89ecc4a..d20f7f00f 100644 --- a/ext/FerriteKAExt/iterator.jl +++ b/ext/FerriteKAExt/iterator.jl @@ -34,9 +34,23 @@ function as_structure_of_arrays(backend, outer_dim, ::Type{CellCache}, sdh::Devi N = Ferrite.nnodes_per_cell(sdh) nodes = KA.zeros(backend, Int, outer_dim, N) coords = KA.zeros(backend, Vec{dim, get_coordinate_eltype(grid)}, outer_dim, N) + cellids = KA.zeros(backend, Int, outer_dim, 1) dofs = KA.zeros(backend, Int, outer_dim, n) end - return ImmutableCellCache(flags, grid, -1, nodes, coords, sdh, dofs) + return CellCache(flags, grid, cellids, nodes, coords, sdh, dofs) +end +function get_substruct(i, cc::CellCache, cellid) + return CellCache( + cc.flags, cc.grid, view(cc.cellid, i, :), + view(cc.nodes, i, :), view(cc.coords, i, :), cc.dh, view(cc.dofs, i, :) + ) +end +function Ferrite.reinit!(cc::CellCache{<:Any,<:DeviceGrid}, cellid::Int) + cc.cellid[1] = cellid # TODO remove this in future versions + cc.flags.nodes && Ferrite.cellnodes!(cc.nodes, cc.grid, cellid) + cc.flags.coords && Ferrite.getcoordinates!(cc.coords, cc.grid, cellid) + cc.dh !== nothing && cc.flags.dofs && Ferrite.celldofs!(cc.dofs, cc.dh, cellid) + return cc end function Ferrite.CellCache(backend, dh::HostDofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim} diff --git a/src/iterators.jl b/src/iterators.jl index f432455fa..7eb3f244a 100644 --- a/src/iterators.jl +++ b/src/iterators.jl @@ -32,24 +32,24 @@ cell. The cache is updated for a new cell by calling `reinit!(cache, cellid)` wh See also [`CellIterator`](@ref). """ -mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}} - const flags::UpdateFlags - const grid::G +struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}, CIDType, IVType, CVType <: AbstractArray{X}} + flags::UpdateFlags + grid::G # Pretty useless to store this since you have it already for the reinit! call, but # needed for the CellIterator(...) workflow since the user doesn't necessarily control # the loop order in the cell subset. - cellid::Int - const nodes::Vector{Int} - const coords::Vector{X} - const dh::DH - const dofs::Vector{Int} + cellid::CIDType + nodes::IVType + coords::CVType + dh::DH + dofs::IVType end function CellCache(grid::Grid{dim, C, T}, flags::UpdateFlags = UpdateFlags()) where {dim, C, T} N = nnodes_per_cell(grid, 1) # nodes and coords will be resized in `reinit!` nodes = zeros(Int, N) coords = zeros(Vec{dim, T}, N) - return CellCache(flags, grid, -1, nodes, coords, nothing, Int[]) + return CellCache(flags, grid, [-1], nodes, coords, nothing, Int[]) end function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) where {dim} @@ -58,7 +58,7 @@ function CellCache(dh::DofHandler{dim}, flags::UpdateFlags = UpdateFlags()) wher nodes = zeros(Int, N) coords = zeros(Vec{dim, get_coordinate_eltype(get_grid(dh))}, N) celldofs = zeros(Int, n) - return CellCache(flags, get_grid(dh), -1, nodes, coords, dh, celldofs) + return CellCache(flags, get_grid(dh), [-1], nodes, coords, dh, celldofs) end function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags()) @@ -67,7 +67,7 @@ function CellCache(sdh::SubDofHandler, flags::UpdateFlags = UpdateFlags()) end function reinit!(cc::CellCache, i::Int) - cc.cellid = i + cc.cellid[1] = i # TODO remove this in future versions if cc.flags.nodes resize!(cc.nodes, nnodes_per_cell(cc.grid, i)) cellnodes!(cc.nodes, cc.grid, i) @@ -97,10 +97,10 @@ end getnodes(cc::CellCache) = cc.nodes getcoordinates(cc::CellCache) = cc.coords celldofs(cc::CellCache) = cc.dofs -cellid(cc::CellCache) = cc.cellid +cellid(cc::CellCache) = cc.cellid[1] # TODO: These should really be replaced with something better... -nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cc.cellid)) +nfacets(cc::CellCache) = nfacets(getcells(cc.grid, cellid(cc))) """ @@ -121,20 +121,20 @@ calling `reinit!(cache, fi::FacetIndex)`. See also [`FacetIterator`](@ref). """ -mutable struct FacetCache{CC <: CellCache} - const cc::CC # const for julia > 1.8 - const dofs::Vector{Int} # aliasing cc.dofs - current_facet_id::Int +struct FacetCache{CC <: CellCache, DVType, CFType} + cc::CC + dofs::DVType # aliasing cc.dofs + current_facet_id::CFType end function FacetCache(args...) cc = CellCache(args...) - return FacetCache(cc, cc.dofs, 0) + return FacetCache(cc, cc.dofs, [0]) end function reinit!(fc::FacetCache, facet::BoundaryIndex) cellid, facetid = facet reinit!(fc.cc, cellid) - fc.current_facet_id = facetid + fc.current_facet_id[1] = facetid return nothing end @@ -148,7 +148,7 @@ for op in (:getnodes, :getcoordinates, :cellid, :celldofs) end @inline function reinit!(fv::FacetValues, fc::FacetCache) - return reinit!(fv, fc.cc, fc.current_facet_id) + return reinit!(fv, fc.cc, fc.current_facet_id[1]) end """ diff --git a/docs/src/howto/gpu_heat_howto_literate.jl b/docs/src/howto/gpu_heat_howto_literate.jl index 6efd99cdb..6931d0b25 100644 --- a/docs/src/howto/gpu_heat_howto_literate.jl +++ b/docs/src/howto/gpu_heat_howto_literate.jl @@ -68,7 +68,8 @@ end cv_i = cv[task_index] ## Query work item cell cache. The call on the item initializes replaces the reinit! call. - cc_i = cc[task_index](cellid) + cc_i = cc[task_index] + Ferrite.reinit!(cc_i, cellid) ## Query assembly buffer. Ke = view(Kes, i, :, :)

If this is more about keeping the reinit! workflow on the GPU I can add it, but it will increase memory pressure, because we store the cellid in global memory, just to eventually load it later from global memory instead of keeping it locally.

Note that we cannot do any shenanigans like resizing on the GPU.

KnutAM

This is really cool, but took some time to get into :)

After going through this PR and making changes in #1319, I have the following overall thoughts.

I still think one container type should be enough, e.g. SoAContainer. Whether we should have distribute_to_tasks I'm not sure about, but it is nice to not be bounded by a given type but rather have a function that creates the suitable container.
In general I think we need to provide new types for those that we cannot reinit! as usual then (if the additional memory pressure degrades performance for having a global ref-value as a shared array), which means we need the ImmutableCellCache and equivalently ImmutableFacetValues. For naming I'm not a real fan since their contents is mutated, so it does give a false sense of safety that they are not mutated, i.e. that given cc::ImmutableCellCahce then cc(1) and cc(2) are indeed different types. But the SoAContainer should then still return the same container-type as what is provided, i.e. we shouldn't give it a CellCache and get an ImmutableCellCache since these have different usages.
While the cc(cellid) is neat, I think it would be better to have cc_i = reinit(cc, cellid) instead. I also don't think we should define reinit!(cc, cellid), since this only does a partial reinitialization which can cause hard-to-catch bugs for users.
Seems like dof_range cannot be used at the moment, so an element routine for coupled problems would need to change. However, that can be fixed by a named tuple, see #1319.
It is not clear to me how this would work for a grid with mixed cell type since we need type-stability on the GPU. I think we should consider this from the beginning, as this might imply that we need a SubGrid structure as well?
Seems like the assembler code and definition of GPUConstraintHandler should be moved to the KA-extension? But I haven't had time to go through those parts in details yet.
For the howto, I think this should be a tutorial, and then limited to a single method. Alternatives, such as a cuda-specific kernel can be put in a howto, or alternatively in a collapsible block. We should really provide a MWE for the users as the first tutorial. Same applies for the matrix-free approach, since we don't use that we shouldn't introduce that here IMO. But we could add a how-to for solving that with iterative solvers instead.

termi-official · 2026-04-16T20:25:34Z

Thanks for taking time for a detailed review!

1. I still think one container type should be enough, e.g. `SoAContainer`. Whether we should have `distribute_to_tasks` I'm not sure about, but it is nice to not be bounded by a given type but rather have a function that creates the suitable container.

But this is not how Ferrite is designed, so I sticked to it to minimize surprise and stay consistent in the design. Note that we always are explicit about the types. We have QuadratureRule not create_quadrature_rule, CellValues and FacetValues not create_fevalues, etc.

2. In general I think we need to provide new types for those that we cannot `reinit!` as usual then (if the additional memory pressure degrades performance for having a global ref-value as a shared array), which means we need the `ImmutableCellCache` and equivalently `ImmutableFacetValues`. For naming I'm not a real fan since their contents is mutated, so it does give a false sense of safety that they are not mutated, i.e. that given `cc::ImmutableCellCahce` then `cc(1)` and `cc(2)` are indeed different types. [...]

While I agree that the name is indeed not optimal, the last statement is not true. cc(1) returns an ImmutableCellCache. as_structure_of_arrays returns an ImmutableCellCache and the SOAContainer design it would return a SOAContainer. This is indeed different from CellValues for the technical reasons discussed before. However, I do not understand how using the SOAContainer fixes the reinit! issue.

3. While the `cc(cellid)` is neat, I think it would be better to have `cc_i = reinit(cc, cellid)` instead. I also don't think we should define `reinit!(cc, cellid)`, since this only does a partial reinitialization which can cause hard-to-catch bugs for users.

Agreed on the latter. However, note that your proposal for reinit is a variant of the current design where I decided to overload the function call instead. We can go for the reinit function call to be more consistent tho.

4. Seems like `dof_range` cannot be used at the moment, so an element routine for coupled problems would need to change. However, that can be fixed by a named tuple, see #1319.

If I do not messed up it works. But I get your point. I have adopted a variant where we preserve both dispatches, the by index and the by symbol.

5. It is not clear to me how this would work for a grid with mixed cell type since we need type-stability on the GPU. I think we should consider this from the beginning, as this might imply that we need a `SubGrid` structure as well?

I have no good idea yet how this can be resolved without diverging from the original Ferrite design too much. This part has always been a PITA to me. I always partition the grid upfront (see https://github.com/JuliaHealth/Thunderbolt.jl/blob/001c324fb492422253deef9978a59255c1c94b8a/src/mesh/simple_meshes.jl#L58-L60) to sidestep the issues that the cells array is a potential junk box.

6. Seems like the assembler code and definition of `GPUConstraintHandler` should be moved to the KA-extension? But I haven't had time to go through those parts in details yet.

Yes, but I still need to port these to KernelAbstractions (note that they use CUDA kernels right now).

7. For the howto, I think this should be a tutorial, and then limited to a single method. Alternatives, such as a cuda-specific kernel can be put in a howto, or alternatively in a collapsible block. We should really provide a MWE for the users as the first tutorial. Same applies for the matrix-free approach, since we don't use that we shouldn't introduce that here IMO. But we could add a how-to for solving that with iterative solvers instead.

I do not agree on this point. https://diataxis.fr/how-to-guides/ states

Every how-to guide should answer to a human project, in other words. It should show what the human needs to do, with the tools at hand, to obtain the result they need.

and below

A tutorial’s purpose is to help the pupil acquire basic competence. A how-to guide’s purpose is to help the already-competent user perform a particular task correctly.

The heat equation tutorial gives the basic competence, while newly added piece teaches them how to do GPU assembly correctly. Does that make sense?

termi-official · 2026-04-20T09:10:33Z

I played a bit around with subdomains, but these are really tricky.

If we say each subdofhandler has one subgrid, then we either loose the global indexing, or we need to use a sparse vector, or we have dense vectors with dead elements (e.g. all nodes have index zero). The sparse vector seems to introduce some unnecessary overhead for the potential bounds checks. If we say we have some local indexing, then this could also be slightly confusing for users, as the reinit! call will then run on the local index, but the cellid call should still return the global index for correctness reasons (as users might use the cell index for per-element data).

KnutAM · 2026-04-21T06:42:37Z

SoAContainer: The idea of having a SoAContainer, is to have a single type that acts as the AbstractVector with the element type being the structure with views into the arrays, instead of introduce a new type for every item type. So I think of this as having Vector{T} instead of VectorOfCellValues, VectorOfFacetValues
distribute_to_tasks(backend, item, n_workers) is about taking an initial item and distributing it to each task in a suitable way. This is separate from the SoAContainer, but would return an SoAContainer if we use that. As noted above, I think this is optional (but provides a nice infrastructure for distributing the items in the "right" way").
cc(cellid) - sorry, there was an important typo here :D

given cc::ImmutableCellCahcethencc(1)andcc(2) are indeed different types.

I meant to write that cc(1) and cc(2) returns different objects, but that these have references to the same underlying arrays which are modified by the indexing. Which to me seems very surprising for a user. I think it is natural to assume that cc1 = cc(1) and cc2 = cc(2) are independent. Sorry about that confusion!

as_structure_of_arrays returns an ImmutableCellCache and the SOAContainer design it would return a SOAContainer.

This is how I see the diff, so the SoAContainer doesn't affect the reinit, at that point you've already indexed into the container to get the view-representation for the ImmutableCellCache.

- caches = CellCacheContainer(backend, dh, nworkers)
+ caches = SoAContainer(backend, ImmutableCellCache(dh), nworkers)

cc_task = caches[taskid]::ImmutableCellCache

- cc_i = cc_task(cellid)
+ cc_i = reinit(cc_task, cellid)

I don't know a better name though.

termi-official added 2 commits February 27, 2026 01:48

Minimal assembly example

a7ba0c1

Add minimal constraint handler

489155a

termi-official added 4 commits February 27, 2026 13:25

Add actual subdofhandler

f3af4b9

Add some brief docs and group functions. Also rename some stuff to be…

1311ef4

…tter align with Ferrite internals.

Fix some more subtle Int issues and move most of the stuff into an ex…

a92356b

…tension.

Move heat assembly to tests

2ce407b

termi-official force-pushed the do/gpu-mwe branch from ed8eb72 to 2ce407b Compare February 27, 2026 21:45

termi-official added 3 commits February 27, 2026 22:47

Runic

6442941

Missing file.

e0e1e84

Remove StaticArrays dep

b12d500

termi-official force-pushed the do/gpu-mwe branch from fdff415 to b12d500 Compare February 27, 2026 22:06

termi-official added 9 commits February 27, 2026 23:09

Fix trailing T

6826478

Remove self-qualification

e687709

Typo

f00dbb8

Remove dynamic dispatch from error message generation

ff8b59a

Define proper reinit for GPUCellCache

dc9954c

Make SOA transformation explicit.

60a23c8

Try to trick CodeCov by using KA CPU.

ae09f6b

Runic

12879fb

Partial recovery of detJ error

cf135fb

termi-official commented Feb 28, 2026

View reviewed changes

termi-official marked this pull request as ready for review February 28, 2026 01:24

termi-official mentioned this pull request Feb 28, 2026

More abstract vectors #1252

Merged

termi-official added 4 commits March 3, 2026 19:31

Revert to global memory usage and make sure the tests work.

545a317

Fix and comment how-to for naive GPU assembly. For now this how-to is…

04ff942

… integrated as a full test. I will split it up later.

Remove old materialize fun

8467898

Make runic happy

f715779

KnutAM reviewed Mar 4, 2026

View reviewed changes

Hide the SOA access behind a nicer user-interface

560f8b5

termi-official added 12 commits March 4, 2026 17:36

Remove todo from how-to

e60c93d

Rename GPU -> Device for most data structures

4fe3556

Split up extension into KA part and CUDA part. Also restore CPU error…

fdc1085

… messages.

Runic

9df89bf

Revert type stability issue and extend interfaces as suggested in the…

32eb310

… reivew.

Docstring

2b3ff9c

Revert setdiff change as there is no measurable gain

651600e

Runic

530bdd2

Sneak GPU assembly into docs

ac3a916

Runic

e2af28c

Missing file.

179c993

Tweak example

280f35d

KnutAM reviewed Mar 5, 2026

View reviewed changes

KnutAM reviewed Mar 6, 2026

View reviewed changes

termi-official added 2 commits April 9, 2026 14:57

Device assembler supertype

4e2d0db

More interface coverage

ff09269

This was referenced Apr 11, 2026

WIP add task_local for duplication things for multithreading #1070

Draft

Review on #1291 #1319

Draft

KnutAM reviewed Apr 12, 2026

View reviewed changes

Comment thread src/soa_utils.jl

Comment thread docs/src/howto/gpu_heat_howto_literate.jl

Comment thread ext/FerriteCudaExt.jl

Comment thread ext/FerriteKAExt/soa_core.jl Outdated

Comment thread ext/FerriteKAExt/adapt_core.jl Outdated

termi-official added 5 commits April 16, 2026 14:45

Fix CUDA compat

93de529

Hoist the cache query out of the inner loop in the how-to

73d0ee2

Dispatch only on adapt_structure

6f0b4e7

Merge master

92d97f6

[skip ci] Runic

bd60eab

Add device-compatible dof_range

f3459a2

		cv_gpu = CellValuesContainer(backend, n_workers, cv)
		cc_gpu = CellCacheContainer(backend, n_workers, dh_gpu)

	mutable struct CellCache{X, G <: AbstractGrid, DH <: Union{AbstractDofHandler, Nothing}}
	const flags::UpdateFlags
	const grid::G
	# Pretty useless to store this since you have it already for the reinit! call, but
	# needed for the CellIterator(...) workflow since the user doesn't necessarily control
	# the loop order in the cell subset.
	cellid::Int
	const nodes::Vector{Int}
	const coords::Vector{X}
	const dh::DH
	const dofs::Vector{Int}
	end

Conversation

termi-official commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODOs

Uh oh!

codecov Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

termi-official Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KnutAM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KnutAM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KnutAM Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

termi-official Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KnutAM Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

termi-official commented Mar 6, 2026

Uh oh!

KnutAM commented Mar 6, 2026

Uh oh!

KnutAM Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

termi-official Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

termi-official Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

KnutAM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

termi-official commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

termi-official commented Apr 20, 2026

Uh oh!

KnutAM commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

termi-official commented Feb 27, 2026 •

edited

Loading

codecov Bot commented Feb 27, 2026 •

edited

Loading

termi-official Mar 6, 2026 •

edited

Loading

termi-official commented Apr 16, 2026 •

edited

Loading