GPU-native manifold operations via CUDA extension by zazabap · Pull Request #856 · JuliaManifolds/Manifolds.jl

zazabap · 2026-02-20T15:53:11Z

Summary

Add GPU-native implementations for matrix manifold operations (keeping all computation on GPU with zero CPU fallbacks) and a comprehensive test suite (47 tests, all verified passing on RTX 3090).

Extension: `ext/ManifoldsCUDAExt.jl`

GPU-native overrides

Override	Why needed	GPU implementation
`exp!(GeneralUnitaryMatrices, q, p, X)`	stdlib `exp!` calls LAPACK (CPU-only)	Hermitian eigendecomposition via cuSOLVER `heevd!`: for skew-Hermitian X, `exp(X) = V·Diag(exp(-iλ))·V'`
`log_safe!(Y, A::CuArray)`	`schur()` is CPU-only	General eigendecomposition via cuSOLVER `geev!`: `log(A) = V·Diag(log(λ))·V⁻¹`
`retract_qr_fused!(GeneralUnitaryMatrices, ...)`	QR wrapper types cause scalar indexing	Explicit `CuArray(qr.Q)` materialization + diagonal sign correction
`retract_qr_fused!(Grassmann, ...)`	Same QR wrapper issue	Same approach
`rand!(UnitaryMatrices, ...)`	`randn` → QR on GPU needs explicit materialization	`CUDA.randn` + QR with sign correction

All computation stays entirely on GPU — zero CPU fallbacks.

Manifold GPU compatibility

Manifold	GPU-ready operations	Known limitations
Euclidean	✅ `exp!`, `project!`, `inner`, `norm`, `retract!`	`distance()` uses `@simd` → scalar indexing
Sphere	✅ `exp!`, `project!`, `inner`, `norm`	`log!` special cases use scalar indexing
UnitaryMatrices	✅ `exp!`, `log`, `project!`, QR retraction, `rand!`	Polar retraction not supported
Grassmann	✅ `exp!`, `project!`, QR retraction	—
PowerManifold	✅ All ops via `NestedPowerRepresentation`	`ArrayPowerRepresentation` fails
Rotations	❌	`schur()` CPU-only
SPD	❌	`eigen()` in SPDPoint constructor CPU-only
Hyperbolic	❌	`minkowski_metric` scalar indexing

GPU-native implementation details

Matrix exponential (`exp!` for `GeneralUnitaryMatrices`)

For skew-Hermitian tangent vector X (where X' = -X):

iX = im · X  →  Hermitian
F = eigen(Hermitian(iX))  →  cuSOLVER heevd! (entirely on GPU)
exp(X) = F.vectors · Diag(exp(-im · F.values)) · F.vectors'

For real matrices: promote to complex, compute, take real part.

Matrix logarithm (`log_safe!`)

F = eigen(A)  →  cuSOLVER geev! (entirely on GPU)
log(A) = F.vectors · Diag(log.(F.values)) · inv(F.vectors)

For real matrices: promote to complex, compute, take real part.

Tests: 47/47 verified passing

Test group	Count	What's tested
Euclidean Float64 (vector)	6	`exp!`, `retract!`, `project!`, `inner`, `norm`
Euclidean Float32	3	Same ops with `CuArray{Float32}`
Euclidean matrix (n×n)	4	Matrix-shaped Euclidean
Sphere	5	`exp!`, `project!` (point + vector), `inner`, `norm`
UnitaryMatrices(2)	8	GPU-native `exp!`, `project!`, QR retraction, `log`
Grassmann(4,2)	5	`exp!`, `project!` (point + vector), QR retraction
PowerManifold	3	Nested `CuArray` allocation + `exp!`
CPU-vs-GPU equivalence	13	Every GPU result compared to CPU: `isapprox(Array(gpu), cpu)`

All tests gracefully skip when CUDA is not available.

Related PRs

Companion: ManifoldsBase.jl CUDA extension (Add minimal CUDA extension for GPU-aware allocation ManifoldsBase.jl#253)
For solver-level GPU tests and benchmarks, see Manopt.jl (Add GPU solver tests on Gradient Descent and Conjugate Gradient Descent Manopt.jl#579)
For AD gradient GPU tests, see ManifoldDiff.jl (Add CUDA GPU tests for AD gradient computation ManifoldDiff.jl#84)

Add ManifoldsCUDAExt that provides GPU-compatible overrides for: - QR retraction on GeneralUnitaryMatrices: avoids scalar indexing by extracting Q/R as dense CuArrays and using broadcasting for sign correction instead of Diagonal matrix multiply - rand! for UnitaryMatrices: uses CUDA.randn for on-GPU random generation instead of CPU randn + transfer - log_safe! for real matrices on GPU: routes through complex path since base implementation calls convert(Matrix, ...) forcing CPU Primary targets are UnitaryMatrices(n), PowerManifold, and ProductManifold as used by ParametricDFT.jl for quantum circuit optimization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Test GPU operations on Euclidean, Sphere, and UnitaryMatrices(2): - exp, retract (polar, QR), log, project, inner, norm - Float32 and ComplexF64 element types - PowerManifold with nested CuArrays Tests are conditional on CUDA.functional(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace CPU-fallback exp! with GPU-native Hermitian eigendecomposition (for skew-Hermitian X: exp(X) = V*Diag(exp(-i*lambda))*V' via heevd!). Replace CPU-fallback log_safe! with general eigendecomposition via geev!. Add Grassmann QR retraction override for GPU. Trim docstrings. 47 tests: Euclidean (Float64/Float32, vector/matrix), Sphere, UnitaryMatrices (exp, project, QR retraction, log), Grassmann (exp, project, QR retraction), PowerManifold nested. Zero CPU fallbacks — all computation stays on GPU.

codecov · 2026-02-20T16:22:25Z

Codecov Report

❌ Patch coverage is 0% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.13%. Comparing base (da714af) to head (11e2992).
⚠️ Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
ext/ManifoldsCUDAExt.jl	0.00%	54 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #856      +/-   ##
==========================================
- Coverage   99.96%   99.13%   -0.84%     
==========================================
  Files          98       99       +1     
  Lines        9681     9735      +54     
==========================================
- Hits         9678     9651      -27     
- Misses          3       84      +81

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mateuszbaran · 2026-02-20T17:51:29Z

Thanks, this is really useful. For log_safe! and exp on GeneralUnitaryMatrices I think this is the right approach. I'd prefer rand! and retract_qr_fused! to be done a bit differently -- the problem seems to be Array(qrfac.Q), I will make a separate PR myself for that.

mateuszbaran · 2026-02-20T18:05:23Z

I don't really have a good solution for CI and code coverage here. We could problably # COV_EXCL_LINE the entire CUDA extension until we find a place to host test runners with GPU. @kellertuer , what do you think?

mateuszbaran · 2026-02-20T18:35:47Z

Rotations ❌ schur() CPU-only

IIRC Rotations only uses schur for get_basis_diagonalizing? Everything else should work fine on GPU.

SPD ❌ eigen() in SPDPoint constructor CPU-only

What's the problem here? eigen is supported on GPU.

Hyperbolic ❌ minkowski_metric scalar indexing

Yes, this is going to be problematic. There are workarounds for this pattern of scalar indexing but they might be too slow to bring any real benefit.

zazabap · 2026-02-21T10:14:41Z

Rotations ❌ schur() CPU-only

IIRC Rotations only uses schur for get_basis_diagonalizing? Everything else should work fine on GPU.

I will double check this point. CPU-GPU fallback will be costly and it may even slow down the process.

SPD ❌ eigen() in SPDPoint constructor CPU-only

What's the problem here? eigen is supported on GPU.

Thanks for indicating the point, I will implement it shortly.

Hyperbolic ❌ minkowski_metric scalar indexing

Yes, this is going to be problematic. There are workarounds for this pattern of scalar indexing but they might be too slow to bring any real benefit.

For minkowski_metric I am also investigating the possible solution.

kellertuer · 2026-02-22T22:05:04Z

I don't really have a good solution for CI and code coverage here. We could problably # COV_EXCL_LINE the entire CUDA extension until we find a place to host test runners with GPU. @kellertuer , what do you think?

No, that would be cheating. Total cheating. Then we could also just do that on all our code and call it a day.

What do other packages with CUDA stuff do? We are surely not the first package doing CUDA stuff in the Julia ecosystem.

mateuszbaran · 2026-02-23T09:32:27Z

What do other packages with CUDA stuff do? We are surely not the first package doing CUDA stuff in the Julia ecosystem.

Well, there are many packages that run CI with CUDA but they either use paid or self-hosted solutions, or some packages don't care about properly covering CUDA code.

After a talk on Slack it seems that for now the solution would be to put CUDA code in a separate package.

kellertuer · 2026-02-23T09:48:38Z

Yes, sorry, that we do not find a better solution for now that we both feel comfortable with.

One other solution that I do see (but which is beyond my knowledge and capacities) is: If we base this approach on a general interface (providing support for CUDA/AMD/Metal/...?) that has a “CPUDummyType” to test against, that would again be fine with me, because then testing and validity would be on the interface side ( as long as our code works on the CPU dummy).

But again, I do not have the capacity myself to read up on this or to discover and learn all the necessary details.

mateuszbaran · 2026-02-23T09:55:30Z

Yes, there are existing GPU-like but actually CPU array backends. Part of the problem is that different backends vary in capabilities to a degree, so it's not equivalent to having proper coverage.

mateuszbaran · 2026-02-23T14:13:38Z

@zazabap could you contribute the new methods of log_safe! and exp on GeneralUnitaryMatrices to https://github.com/JuliaManifolds/ManifoldsGPU.jl ? I think keeping CUDA stuff there is the best approach short-term. It will be much easier to quickly iterate on the code in a new repository 🙂

mateuszbaran · 2026-02-23T14:34:42Z

ArrayPowerRepresentation fails

I think this is roughly one of the most important parts of the entire GPU-enabling work. PowerManifold is exactly the use case where GPU parallelism has the largest potential for benefits. Unfortunately, I don't see simple solutions here. Python and Matlab libraries tend to bake-in "power manifold" support in specific manifolds which is not ideal when you just need one copy, but scales better than our approach.

Probably the right approach would be roughly:

Introduce CUDAPowerRepresentation
For each relevant manifold M implement log!/exp!/ etc. on the PowerManifold(M, CUDAPowerRepresentation(), dim).

Probably the number of manifolds relevant to (2) is fairly low (just Stiefel, Grassmann, UnitaryMatrices should cover the most common use cases).

mateuszbaran · 2026-02-26T10:46:51Z

I think this is roughly one of the most important parts of the entire GPU-enabling work. PowerManifold is exactly the use case where GPU parallelism has the largest potential for benefits. Unfortunately, I don't see simple solutions here. Python and Matlab libraries tend to bake-in "power manifold" support in specific manifolds which is not ideal when you just need one copy, but scales better than our approach.

I've sketched this approach for exp! on Stiefel and GPT-5.3-Codex handled it quite well (well, the third version was fine): JuliaManifolds/ManifoldsGPU.jl@15da7e3 . I've added the CPU implementation and CUDA.jl sources for context.

mateuszbaran · 2026-02-26T13:17:52Z

I've also added two retraction on power manifold of Stiefel -- QR is largely a failure due to lack of orgqr! batching but Polar works well. I think this should illustrate how to handle harder cases for GPU parallelization.

zazabap · 2026-03-06T10:29:58Z

@mateuszbaran Thanks for the example code! I tried to update the Euclidean one which should be the most straight forward as the baseline. I also added JLArray for numerical correctness check. Please let me know whether the addition is appropriate and what I should change.

I consider for the manifolds in Manifolds.jl and ManifoldsGPU.jl could have a one-to-one mapping structure to check the correctness in GPU overrides. I am referring to the code here also for other manifold structure addition later. https://github.com/JuliaManifolds/Manifolds.jl/tree/master/src/manifolds

If necessary, I can also make a check-list for the implementation progress. JuliaManifolds/ManifoldsGPU.jl#2

mateuszbaran · 2026-03-06T14:41:46Z

Yes, Euclidean is the most straightforward but also one that is the least useful to have as explained in my comment to the new PR. I think going over each manifold is premature at this point, and sometimes unnecessary even. Let's focus on a few that are the most useful first.

zazabap · 2026-03-06T14:47:58Z

Ok, I will add the implementation for CUDA override based on Unitary Matrices, Stiefel and Grassmanian.

zazabap · 2026-03-19T13:57:30Z

The implementation and discussion shift to the ManifoldsGPU.jl JuliaManifolds/ManifoldsGPU.jl#5 will record the progress, missing points and potential algorithms to be integrated.

zazabap and others added 3 commits February 16, 2026 03:33

zazabap mentioned this pull request Feb 20, 2026

Add GPU solver tests on Gradient Descent and Conjugate Gradient Descent JuliaManifolds/Manopt.jl#579

Closed

zazabap marked this pull request as draft February 20, 2026 16:21

mateuszbaran mentioned this pull request Feb 23, 2026

Improve GPU compatibility of UnitaryMatrices rand! and QR retraction … #857

Merged

mateuszbaran mentioned this pull request Mar 6, 2026

feat: GPU overrides for Euclidean PowerManifold JuliaManifolds/ManifoldsGPU.jl#2

Closed

6 tasks

This was referenced Mar 6, 2026

Support for the log() on General Unitary Matrices. JuliaGPU/CUDA.jl#3048

Closed

feat: GPU overrides for GeneralUnitaryMatrices zazabap/ManifoldsGPU.jl#5

Merged

zazabap mentioned this pull request Mar 19, 2026

GPU override coverage: tracking and roadmap JuliaManifolds/ManifoldsGPU.jl#5

Open

38 tasks

zazabap closed this Mar 19, 2026

Conversation

zazabap commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Extension: ext/ManifoldsCUDAExt.jl

GPU-native overrides

Manifold GPU compatibility

GPU-native implementation details

Matrix exponential (exp! for GeneralUnitaryMatrices)

Matrix logarithm (log_safe!)

Tests: 47/47 verified passing

Related PRs

Uh oh!

codecov Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mateuszbaran commented Feb 20, 2026

Uh oh!

mateuszbaran commented Feb 20, 2026

Uh oh!

mateuszbaran commented Feb 20, 2026

Uh oh!

zazabap commented Feb 21, 2026

Uh oh!

kellertuer commented Feb 22, 2026

Uh oh!

mateuszbaran commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kellertuer commented Feb 23, 2026

Uh oh!

mateuszbaran commented Feb 23, 2026

Uh oh!

mateuszbaran commented Feb 23, 2026

Uh oh!

mateuszbaran commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mateuszbaran commented Feb 26, 2026

Uh oh!

mateuszbaran commented Feb 26, 2026

Uh oh!

zazabap commented Mar 6, 2026

Uh oh!

mateuszbaran commented Mar 6, 2026

Uh oh!

zazabap commented Mar 6, 2026

Uh oh!

zazabap commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zazabap commented Feb 20, 2026 •

edited

Loading

Extension: `ext/ManifoldsCUDAExt.jl`

Matrix exponential (`exp!` for `GeneralUnitaryMatrices`)

Matrix logarithm (`log_safe!`)

codecov Bot commented Feb 20, 2026 •

edited

Loading

mateuszbaran commented Feb 23, 2026 •

edited

Loading

mateuszbaran commented Feb 23, 2026 •

edited

Loading