GPU-native manifold operations via CUDA extension#856
GPU-native manifold operations via CUDA extension#856zazabap wants to merge 3 commits intoJuliaManifolds:masterfrom
Conversation
Add ManifoldsCUDAExt that provides GPU-compatible overrides for: - QR retraction on GeneralUnitaryMatrices: avoids scalar indexing by extracting Q/R as dense CuArrays and using broadcasting for sign correction instead of Diagonal matrix multiply - rand! for UnitaryMatrices: uses CUDA.randn for on-GPU random generation instead of CPU randn + transfer - log_safe! for real matrices on GPU: routes through complex path since base implementation calls convert(Matrix, ...) forcing CPU Primary targets are UnitaryMatrices(n), PowerManifold, and ProductManifold as used by ParametricDFT.jl for quantum circuit optimization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test GPU operations on Euclidean, Sphere, and UnitaryMatrices(2): - exp, retract (polar, QR), log, project, inner, norm - Float32 and ComplexF64 element types - PowerManifold with nested CuArrays Tests are conditional on CUDA.functional(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace CPU-fallback exp! with GPU-native Hermitian eigendecomposition (for skew-Hermitian X: exp(X) = V*Diag(exp(-i*lambda))*V' via heevd!). Replace CPU-fallback log_safe! with general eigendecomposition via geev!. Add Grassmann QR retraction override for GPU. Trim docstrings. 47 tests: Euclidean (Float64/Float32, vector/matrix), Sphere, UnitaryMatrices (exp, project, QR retraction, log), Grassmann (exp, project, QR retraction), PowerManifold nested. Zero CPU fallbacks — all computation stays on GPU.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #856 +/- ##
==========================================
- Coverage 99.96% 99.13% -0.84%
==========================================
Files 98 99 +1
Lines 9681 9735 +54
==========================================
- Hits 9678 9651 -27
- Misses 3 84 +81 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks, this is really useful. For |
|
I don't really have a good solution for CI and code coverage here. We could problably |
IIRC Rotations only uses
What's the problem here?
Yes, this is going to be problematic. There are workarounds for this pattern of scalar indexing but they might be too slow to bring any real benefit. |
I will double check this point. CPU-GPU fallback will be costly and it may even slow down the process.
Thanks for indicating the point, I will implement it shortly.
For minkowski_metric I am also investigating the possible solution. |
No, that would be cheating. Total cheating. Then we could also just do that on all our code and call it a day. What do other packages with CUDA stuff do? We are surely not the first package doing CUDA stuff in the Julia ecosystem. |
Well, there are many packages that run CI with CUDA but they either use paid or self-hosted solutions, or some packages don't care about properly covering CUDA code. After a talk on Slack it seems that for now the solution would be to put CUDA code in a separate package. |
|
Yes, sorry, that we do not find a better solution for now that we both feel comfortable with. One other solution that I do see (but which is beyond my knowledge and capacities) is: If we base this approach on a general interface (providing support for CUDA/AMD/Metal/...?) that has a “CPUDummyType” to test against, that would again be fine with me, because then testing and validity would be on the interface side ( as long as our code works on the CPU dummy). But again, I do not have the capacity myself to read up on this or to discover and learn all the necessary details. |
|
Yes, there are existing GPU-like but actually CPU array backends. Part of the problem is that different backends vary in capabilities to a degree, so it's not equivalent to having proper coverage. |
|
@zazabap could you contribute the new methods of |
I think this is roughly one of the most important parts of the entire GPU-enabling work. PowerManifold is exactly the use case where GPU parallelism has the largest potential for benefits. Unfortunately, I don't see simple solutions here. Python and Matlab libraries tend to bake-in "power manifold" support in specific manifolds which is not ideal when you just need one copy, but scales better than our approach. Probably the right approach would be roughly:
Probably the number of manifolds relevant to (2) is fairly low (just Stiefel, Grassmann, UnitaryMatrices should cover the most common use cases). |
I've sketched this approach for |
|
I've also added two retraction on power manifold of Stiefel -- QR is largely a failure due to lack of |
|
@mateuszbaran Thanks for the example code! I tried to update the Euclidean one which should be the most straight forward as the baseline. I also added JLArray for numerical correctness check. Please let me know whether the addition is appropriate and what I should change. I consider for the manifolds in Manifolds.jl and ManifoldsGPU.jl could have a one-to-one mapping structure to check the correctness in GPU overrides. I am referring to the code here also for other manifold structure addition later. https://github.com/JuliaManifolds/Manifolds.jl/tree/master/src/manifolds If necessary, I can also make a check-list for the implementation progress. JuliaManifolds/ManifoldsGPU.jl#2 |
|
Yes, |
|
Ok, I will add the implementation for CUDA override based on Unitary Matrices, Stiefel and Grassmanian. |
|
The implementation and discussion shift to the ManifoldsGPU.jl JuliaManifolds/ManifoldsGPU.jl#5 will record the progress, missing points and potential algorithms to be integrated. |
Summary
Add GPU-native implementations for matrix manifold operations (keeping all computation on GPU with zero CPU fallbacks) and a comprehensive test suite (47 tests, all verified passing on RTX 3090).
Extension:
ext/ManifoldsCUDAExt.jlGPU-native overrides
exp!(GeneralUnitaryMatrices, q, p, X)exp!calls LAPACK (CPU-only)heevd!: for skew-Hermitian X,exp(X) = V·Diag(exp(-iλ))·V'log_safe!(Y, A::CuArray)schur()is CPU-onlygeev!:log(A) = V·Diag(log(λ))·V⁻¹retract_qr_fused!(GeneralUnitaryMatrices, ...)CuArray(qr.Q)materialization + diagonal sign correctionretract_qr_fused!(Grassmann, ...)rand!(UnitaryMatrices, ...)randn→ QR on GPU needs explicit materializationCUDA.randn+ QR with sign correctionAll computation stays entirely on GPU — zero CPU fallbacks.
Manifold GPU compatibility
exp!,project!,inner,norm,retract!distance()uses@simd→ scalar indexingexp!,project!,inner,normlog!special cases use scalar indexingexp!,log,project!, QR retraction,rand!exp!,project!, QR retractionNestedPowerRepresentationArrayPowerRepresentationfailsschur()CPU-onlyeigen()in SPDPoint constructor CPU-onlyminkowski_metricscalar indexingGPU-native implementation details
Matrix exponential (
exp!forGeneralUnitaryMatrices)For skew-Hermitian tangent vector X (where
X' = -X):For real matrices: promote to complex, compute, take real part.
Matrix logarithm (
log_safe!)For real matrices: promote to complex, compute, take real part.
Tests: 47/47 verified passing
exp!,retract!,project!,inner,normCuArray{Float32}exp!,project!(point + vector),inner,normexp!,project!, QR retraction,logexp!,project!(point + vector), QR retractionCuArrayallocation +exp!isapprox(Array(gpu), cpu)All tests gracefully skip when CUDA is not available.
Related PRs