Skip to content

GPU-native manifold operations via CUDA extension#856

Closed
zazabap wants to merge 3 commits intoJuliaManifolds:masterfrom
zazabap:cuda-gpu-extension
Closed

GPU-native manifold operations via CUDA extension#856
zazabap wants to merge 3 commits intoJuliaManifolds:masterfrom
zazabap:cuda-gpu-extension

Conversation

@zazabap
Copy link
Copy Markdown

@zazabap zazabap commented Feb 20, 2026

Summary

Add GPU-native implementations for matrix manifold operations (keeping all computation on GPU with zero CPU fallbacks) and a comprehensive test suite (47 tests, all verified passing on RTX 3090).

Extension: ext/ManifoldsCUDAExt.jl

GPU-native overrides

Override Why needed GPU implementation
exp!(GeneralUnitaryMatrices, q, p, X) stdlib exp! calls LAPACK (CPU-only) Hermitian eigendecomposition via cuSOLVER heevd!: for skew-Hermitian X, exp(X) = V·Diag(exp(-iλ))·V'
log_safe!(Y, A::CuArray) schur() is CPU-only General eigendecomposition via cuSOLVER geev!: log(A) = V·Diag(log(λ))·V⁻¹
retract_qr_fused!(GeneralUnitaryMatrices, ...) QR wrapper types cause scalar indexing Explicit CuArray(qr.Q) materialization + diagonal sign correction
retract_qr_fused!(Grassmann, ...) Same QR wrapper issue Same approach
rand!(UnitaryMatrices, ...) randn → QR on GPU needs explicit materialization CUDA.randn + QR with sign correction

All computation stays entirely on GPU — zero CPU fallbacks.

Manifold GPU compatibility

Manifold GPU-ready operations Known limitations
Euclidean exp!, project!, inner, norm, retract! distance() uses @simd → scalar indexing
Sphere exp!, project!, inner, norm log! special cases use scalar indexing
UnitaryMatrices exp!, log, project!, QR retraction, rand! Polar retraction not supported
Grassmann exp!, project!, QR retraction
PowerManifold ✅ All ops via NestedPowerRepresentation ArrayPowerRepresentation fails
Rotations schur() CPU-only
SPD eigen() in SPDPoint constructor CPU-only
Hyperbolic minkowski_metric scalar indexing

GPU-native implementation details

Matrix exponential (exp! for GeneralUnitaryMatrices)

For skew-Hermitian tangent vector X (where X' = -X):

iX = im · X  →  Hermitian
F = eigen(Hermitian(iX))  →  cuSOLVER heevd! (entirely on GPU)
exp(X) = F.vectors · Diag(exp(-im · F.values)) · F.vectors'

For real matrices: promote to complex, compute, take real part.

Matrix logarithm (log_safe!)

F = eigen(A)  →  cuSOLVER geev! (entirely on GPU)
log(A) = F.vectors · Diag(log.(F.values)) · inv(F.vectors)

For real matrices: promote to complex, compute, take real part.

Tests: 47/47 verified passing

Test group Count What's tested
Euclidean Float64 (vector) 6 exp!, retract!, project!, inner, norm
Euclidean Float32 3 Same ops with CuArray{Float32}
Euclidean matrix (n×n) 4 Matrix-shaped Euclidean
Sphere 5 exp!, project! (point + vector), inner, norm
UnitaryMatrices(2) 8 GPU-native exp!, project!, QR retraction, log
Grassmann(4,2) 5 exp!, project! (point + vector), QR retraction
PowerManifold 3 Nested CuArray allocation + exp!
CPU-vs-GPU equivalence 13 Every GPU result compared to CPU: isapprox(Array(gpu), cpu)

All tests gracefully skip when CUDA is not available.

Related PRs

zazabap and others added 3 commits February 16, 2026 03:33
Add ManifoldsCUDAExt that provides GPU-compatible overrides for:
- QR retraction on GeneralUnitaryMatrices: avoids scalar indexing by
  extracting Q/R as dense CuArrays and using broadcasting for sign
  correction instead of Diagonal matrix multiply
- rand! for UnitaryMatrices: uses CUDA.randn for on-GPU random generation
  instead of CPU randn + transfer
- log_safe! for real matrices on GPU: routes through complex path since
  base implementation calls convert(Matrix, ...) forcing CPU

Primary targets are UnitaryMatrices(n), PowerManifold, and ProductManifold
as used by ParametricDFT.jl for quantum circuit optimization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test GPU operations on Euclidean, Sphere, and UnitaryMatrices(2):
- exp, retract (polar, QR), log, project, inner, norm
- Float32 and ComplexF64 element types
- PowerManifold with nested CuArrays
Tests are conditional on CUDA.functional().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace CPU-fallback exp! with GPU-native Hermitian eigendecomposition
(for skew-Hermitian X: exp(X) = V*Diag(exp(-i*lambda))*V' via heevd!).
Replace CPU-fallback log_safe! with general eigendecomposition via geev!.
Add Grassmann QR retraction override for GPU. Trim docstrings.

47 tests: Euclidean (Float64/Float32, vector/matrix), Sphere,
UnitaryMatrices (exp, project, QR retraction, log), Grassmann
(exp, project, QR retraction), PowerManifold nested.
Zero CPU fallbacks — all computation stays on GPU.
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 0% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.13%. Comparing base (da714af) to head (11e2992).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
ext/ManifoldsCUDAExt.jl 0.00% 54 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
- Coverage   99.96%   99.13%   -0.84%     
==========================================
  Files          98       99       +1     
  Lines        9681     9735      +54     
==========================================
- Hits         9678     9651      -27     
- Misses          3       84      +81     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mateuszbaran
Copy link
Copy Markdown
Member

Thanks, this is really useful. For log_safe! and exp on GeneralUnitaryMatrices I think this is the right approach. I'd prefer rand! and retract_qr_fused! to be done a bit differently -- the problem seems to be Array(qrfac.Q), I will make a separate PR myself for that.

@mateuszbaran
Copy link
Copy Markdown
Member

I don't really have a good solution for CI and code coverage here. We could problably # COV_EXCL_LINE the entire CUDA extension until we find a place to host test runners with GPU. @kellertuer , what do you think?

@mateuszbaran
Copy link
Copy Markdown
Member

Rotations ❌ schur() CPU-only

IIRC Rotations only uses schur for get_basis_diagonalizing? Everything else should work fine on GPU.

SPD ❌ eigen() in SPDPoint constructor CPU-only

What's the problem here? eigen is supported on GPU.

Hyperbolic ❌ minkowski_metric scalar indexing

Yes, this is going to be problematic. There are workarounds for this pattern of scalar indexing but they might be too slow to bring any real benefit.

@zazabap
Copy link
Copy Markdown
Author

zazabap commented Feb 21, 2026

Rotations ❌ schur() CPU-only

IIRC Rotations only uses schur for get_basis_diagonalizing? Everything else should work fine on GPU.

I will double check this point. CPU-GPU fallback will be costly and it may even slow down the process.

SPD ❌ eigen() in SPDPoint constructor CPU-only

What's the problem here? eigen is supported on GPU.

Thanks for indicating the point, I will implement it shortly.

Hyperbolic ❌ minkowski_metric scalar indexing

Yes, this is going to be problematic. There are workarounds for this pattern of scalar indexing but they might be too slow to bring any real benefit.

For minkowski_metric I am also investigating the possible solution.

@kellertuer
Copy link
Copy Markdown
Member

I don't really have a good solution for CI and code coverage here. We could problably # COV_EXCL_LINE the entire CUDA extension until we find a place to host test runners with GPU. @kellertuer , what do you think?

No, that would be cheating. Total cheating. Then we could also just do that on all our code and call it a day.

What do other packages with CUDA stuff do? We are surely not the first package doing CUDA stuff in the Julia ecosystem.

@mateuszbaran
Copy link
Copy Markdown
Member

mateuszbaran commented Feb 23, 2026

What do other packages with CUDA stuff do? We are surely not the first package doing CUDA stuff in the Julia ecosystem.

Well, there are many packages that run CI with CUDA but they either use paid or self-hosted solutions, or some packages don't care about properly covering CUDA code.

After a talk on Slack it seems that for now the solution would be to put CUDA code in a separate package.

@kellertuer
Copy link
Copy Markdown
Member

Yes, sorry, that we do not find a better solution for now that we both feel comfortable with.

One other solution that I do see (but which is beyond my knowledge and capacities) is: If we base this approach on a general interface (providing support for CUDA/AMD/Metal/...?) that has a “CPUDummyType” to test against, that would again be fine with me, because then testing and validity would be on the interface side ( as long as our code works on the CPU dummy).

But again, I do not have the capacity myself to read up on this or to discover and learn all the necessary details.

@mateuszbaran
Copy link
Copy Markdown
Member

Yes, there are existing GPU-like but actually CPU array backends. Part of the problem is that different backends vary in capabilities to a degree, so it's not equivalent to having proper coverage.

@mateuszbaran
Copy link
Copy Markdown
Member

@zazabap could you contribute the new methods of log_safe! and exp on GeneralUnitaryMatrices to https://github.com/JuliaManifolds/ManifoldsGPU.jl ? I think keeping CUDA stuff there is the best approach short-term. It will be much easier to quickly iterate on the code in a new repository 🙂

@mateuszbaran
Copy link
Copy Markdown
Member

mateuszbaran commented Feb 23, 2026

ArrayPowerRepresentation fails

I think this is roughly one of the most important parts of the entire GPU-enabling work. PowerManifold is exactly the use case where GPU parallelism has the largest potential for benefits. Unfortunately, I don't see simple solutions here. Python and Matlab libraries tend to bake-in "power manifold" support in specific manifolds which is not ideal when you just need one copy, but scales better than our approach.

Probably the right approach would be roughly:

  1. Introduce CUDAPowerRepresentation
  2. For each relevant manifold M implement log!/exp!/ etc. on the PowerManifold(M, CUDAPowerRepresentation(), dim).

Probably the number of manifolds relevant to (2) is fairly low (just Stiefel, Grassmann, UnitaryMatrices should cover the most common use cases).

@mateuszbaran
Copy link
Copy Markdown
Member

I think this is roughly one of the most important parts of the entire GPU-enabling work. PowerManifold is exactly the use case where GPU parallelism has the largest potential for benefits. Unfortunately, I don't see simple solutions here. Python and Matlab libraries tend to bake-in "power manifold" support in specific manifolds which is not ideal when you just need one copy, but scales better than our approach.

I've sketched this approach for exp! on Stiefel and GPT-5.3-Codex handled it quite well (well, the third version was fine): JuliaManifolds/ManifoldsGPU.jl@15da7e3 . I've added the CPU implementation and CUDA.jl sources for context.

@mateuszbaran
Copy link
Copy Markdown
Member

I've also added two retraction on power manifold of Stiefel -- QR is largely a failure due to lack of orgqr! batching but Polar works well. I think this should illustrate how to handle harder cases for GPU parallelization.

@zazabap
Copy link
Copy Markdown
Author

zazabap commented Mar 6, 2026

@mateuszbaran Thanks for the example code! I tried to update the Euclidean one which should be the most straight forward as the baseline. I also added JLArray for numerical correctness check. Please let me know whether the addition is appropriate and what I should change.

I consider for the manifolds in Manifolds.jl and ManifoldsGPU.jl could have a one-to-one mapping structure to check the correctness in GPU overrides. I am referring to the code here also for other manifold structure addition later. https://github.com/JuliaManifolds/Manifolds.jl/tree/master/src/manifolds

If necessary, I can also make a check-list for the implementation progress. JuliaManifolds/ManifoldsGPU.jl#2

@mateuszbaran
Copy link
Copy Markdown
Member

Yes, Euclidean is the most straightforward but also one that is the least useful to have as explained in my comment to the new PR. I think going over each manifold is premature at this point, and sometimes unnecessary even. Let's focus on a few that are the most useful first.

@zazabap
Copy link
Copy Markdown
Author

zazabap commented Mar 6, 2026

Ok, I will add the implementation for CUDA override based on Unitary Matrices, Stiefel and Grassmanian.

@zazabap
Copy link
Copy Markdown
Author

zazabap commented Mar 19, 2026

The implementation and discussion shift to the ManifoldsGPU.jl JuliaManifolds/ManifoldsGPU.jl#5 will record the progress, missing points and potential algorithms to be integrated.

@zazabap zazabap closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants