Add GPU solver tests on Gradient Descent and Conjugate Gradient Descent by zazabap · Pull Request #579 · JuliaManifolds/Manopt.jl

zazabap · 2026-02-20T16:00:04Z

Summary

Add a comprehensive GPU test suite (17 tests, all verified passing on RTX 3090) verifying that Manopt.jl solvers work transparently with CuArray-backed manifold points. No ManoptCUDAExt.jl is needed — the _produce_type fix from #577 handles GPU allocation natively.

This supersedes PR #574 (which used a CUDA extension approach, now unnecessary).

Which solvers work on GPU

Verified working (tested in this PR)

Solver	Stepsize	Manifold	Precision
`gradient_descent`	`ConstantLength`	Euclidean(3)	Float64
`gradient_descent`	`ArmijoLinesearch(M, p)`	Euclidean(3)	Float64
`gradient_descent`	`ConstantLength`	Euclidean(3)	Float32
`gradient_descent`	`ConstantLength`	Euclidean(3,3) matrix	Float64
`gradient_descent`	`ConstantLength`	Sphere(2)	Float64
`gradient_descent`	`ArmijoLinesearch` + `record=[:Cost]`	Euclidean(3)	Float64
`conjugate_gradient_descent`	`ConstantLength`	Euclidean(5)	Float64

Expected to work (same code path, not individually tested)

Any first-order solver with user-provided gradient works if the manifold retraction and gradient are GPU-compatible (quasi_Newton, subgradient_method, etc.).

Requirements for GPU solvers

Stepsize: ConstantStepsize/ConstantLength always work. ArmijoLinesearch(M, p) works when constructed with a GPU point p (allocates GPU candidate buffer).
Gradient: User-provided gradients must return CuArray. Zygote AD works (see Add CUDA GPU tests for AD gradient computation ManifoldDiff.jl#84); ForwardDiff does not (scalar indexing in seed!).
Manifold: Must have GPU-compatible operations (see GPU-native manifold operations via CUDA extension Manifolds.jl#856 for which manifolds work).
Recording: record=[:Cost] works transparently.

Why no ManoptCUDAExt.jl?

PR #577 added _produce_type(factory, M, p) which passes the user's point type to stepsize/direction-update constructors. This means ArmijoLinesearchStepsize(M, p) automatically does candidate_point = allocate(p), producing a CuArray when p is a CuArray. The solver loop uses broadcasting and ManifoldsBase operations — all GPU-compatible. No Manopt-side overrides are needed.

GPU benchmark results (RTX 3090, 300 iterations)

Problem	CPU	GPU	Speedup
Euclidean dense quadratic ½p'Ap, n=8192	2.2s	0.19s	11.4x
Euclidean Zygote AD gradient, n=8192	4.1s	0.43s	9.6x
Euclidean Armijo linesearch, n=262144	0.89s	0.11s	7.9x
Grassmann Gr(2048,512) Rayleigh quotient	12.5s	8.4s	1.5x
Sphere S^32767 data fitting (CG)	0.14s	0.10s	2.4x

GPU speedup scales with per-iteration compute intensity: O(n²) matrix operations benefit most.

Tests: 17/17 verified passing

Test group	Count	What's verified
GD + ConstantLength	2	`result isa CuArray`, converges to known target
GD + ArmijoLinesearch	2	Armijo backtracking works on GPU
Float32	2	No Float64 promotion, `atol=1e-3`
Matrix Euclidean	3	3×3 matrix points, Frobenius norm
Conjugate GD	2	CG with Fletcher-Reeves default
Sphere	3	Converges to known closest point, result on sphere
Recording	2	`record=[:Cost]` produces decreasing cost sequence
CPU-vs-GPU equivalence	1	`isapprox(Array(gpu_result), cpu_result)`

All tests gracefully skip when CUDA is not available.

Changes

test/test_cuda_ext.jl — new test file (17 tests)
test/runtests.jl — added include("test_cuda_ext.jl")
test/Project.toml — added CUDA to [deps] and [compat]
Changelog.md — added entry about GPU test coverage

Related PRs

Depends on: Manopt.jl Add point capability to _produce_type to improve GPU compatibility #577 (_produce_type fix) — already merged
Companion: ManifoldsBase.jl CUDA extension (Add minimal CUDA extension for GPU-aware allocation ManifoldsBase.jl#253)
Companion: Manifolds.jl CUDA extension (GPU-native manifold operations via CUDA extension Manifolds.jl#856)
Companion: ManifoldDiff.jl CUDA tests (Add CUDA GPU tests for AD gradient computation ManifoldDiff.jl#84)
Supersedes: Add CUDA extension for GPU-accelerated solvers #574

17 tests verifying gradient_descent and conjugate_gradient_descent work transparently with CuArray-backed manifold points. No ManoptCUDAExt needed — the _produce_type fix from JuliaManifolds#577 handles GPU allocation natively. Tests cover: ConstantLength/ArmijoLinesearch stepsizes, Float32/Float64, matrix Euclidean, Sphere, recording with :Cost, CPU-vs-GPU equivalence. All tests use known closed-form solutions for verification.

codecov · 2026-02-20T16:20:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (2bd767a) to head (33d325a).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##            master      #579   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           91        91           
  Lines        10013     10013           
=========================================
  Hits         10013     10013

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zazabap · 2026-02-20T16:33:08Z

Benchmark: CPU vs GPU — Compute-Intensive Objectives

The benchmark script below tests GPU speedup on three manifolds with O(n²) or O(n·N) compute per iteration:

Panel	Manifold	Objective	Compute/iter
1	Sphere S^(n-1)	500-point data fitting: f(p) = -p'·D·1_N	O(n·N)
2	Grassmann Gr(n,k)	Rayleigh quotient: f(P) = -tr(P'AP) with dense A	O(n²k)
3	Euclidean R^n	Dense quadratic: f(p) = ½p'Ap - b'p	O(n²)

Results (RTX 3090)

Peak speedups: 11.4x (Euclidean dense quadratic, n=8192), 9.6x (Zygote AD, n=8192), 4.5x (Sphere, n=32768), 1.5x (Grassmann, 2048×512)

Note: GPU advantage requires sufficient per-iteration compute. Element-wise O(n) operations are memory-bound and won't see meaningful speedup.

Benchmark script (examples/benchmark_cpu_vs_gpu.jl)

#!/usr/bin/env julia
#
# benchmark_cpu_vs_gpu.jl — CPU vs GPU benchmark with compute-intensive objectives
#
# Uses objectives with O(n²) or O(n·N) compute per iteration to stress GPU:
#   Panel 1: Sphere — data fitting with N data points (matrix-vector products)
#   Panel 2: Grassmann — Rayleigh quotient with dense A (matrix-matrix products)
#   Panel 3: Euclidean — quadratic f(p) = ½p'Ap - b'p with dense SPD matrix A
#   Panel 4: Summary speedup comparison
#
# Usage:
#   julia --project=test examples/benchmark_cpu_vs_gpu.jl

using LinearAlgebra
using Random
using Printf
using Statistics

using CUDA
using CairoMakie
using ManifoldsBase
using ManifoldsBase: exp!, inner, retract!
import Manifolds
using Manifolds: Euclidean, Grassmann, QRRetraction, project
using ManifoldDiff
using Manopt
using Zygote
using ADTypes: AutoZygote

const OUTDIR = joinpath(@__DIR__, "results")
mkpath(OUTDIR)

# ── Utilities ────────────────────────────────────────────────────────

function bench(f, nruns=5)
    f()  # warmup
    CUDA.synchronize()
    local result
    times = Float64[]
    for _ in 1:nruns
        CUDA.synchronize()
        t0 = time()
        result = f()
        CUDA.synchronize()
        push!(times, time() - t0)
    end
    return median(times), result
end

# ── Panel 1: Sphere — data fitting with N points ────────────────────
#
# Given N data points d_i on S^(n-1), stored as columns of matrix D (n × N):
#   f(p)      = -p' D 1_N  =  -sum(D' p)  (maximize total alignment)
#   grad_f(p) = -(D 1_N) + dot(p, D 1_N) p  (Riemannian gradient)
#
# Each cost/gradient evaluation does an O(n·N) matrix-vector product.

function bench_sphere(; n_data=500, n_iters=300)
    dims = [128, 512, 2048, 8192, 32768]

    cpu_gd = Float64[]
    gpu_gd = Float64[]
    cpu_cg = Float64[]
    gpu_cg = Float64[]

    println("\n── Panel 1: Sphere S^(n-1) — Data fitting ($n_data points, $n_iters iters) ──")
    for n in dims
        M = Manifolds.Sphere(n - 1)
        rng = MersenneTwister(42)

        D_c = randn(rng, n, n_data)
        for j in 1:n_data
            D_c[:, j] ./= LinearAlgebra.norm(D_c[:, j])
        end
        D_g = CuArray(D_c)

        Dsum_c = D_c * ones(n_data)
        Dsum_g = CuArray(Dsum_c)

        f_cpu(M, p) = -dot(p, Dsum_c)
        f_gpu(M, p) = -dot(p, Dsum_g)

        grad_f_cpu(M, p) = -Dsum_c .+ dot(p, Dsum_c) .* p
        grad_f_gpu(M, p) = -Dsum_g .+ dot(p, Dsum_g) .* p

        p0_c = randn(rng, n); p0_c ./= LinearAlgebra.norm(p0_c)
        p0_g = CuArray(p0_c)
        stop = StopAfterIteration(n_iters)
        ss = Manopt.ConstantStepsize(M, 0.005)

        t_c, _ = bench(() -> gradient_descent(M, f_cpu, grad_f_cpu, copy(p0_c);
            stepsize=ss, stopping_criterion=stop))
        t_g, _ = bench(() -> gradient_descent(M, f_gpu, grad_f_gpu, copy(p0_g);
            stepsize=ss, stopping_criterion=stop))
        push!(cpu_gd, t_c); push!(gpu_gd, t_g)

        t_c2, _ = bench(() -> conjugate_gradient_descent(M, f_cpu, grad_f_cpu, copy(p0_c);
            stepsize=ss, stopping_criterion=stop))
        t_g2, _ = bench(() -> conjugate_gradient_descent(M, f_gpu, grad_f_gpu, copy(p0_g);
            stepsize=ss, stopping_criterion=stop))
        push!(cpu_cg, t_c2); push!(gpu_cg, t_g2)

        @printf("  n=%6d  GD: CPU %.3e GPU %.3e (%5.1fx)  CG: CPU %.3e GPU %.3e (%5.1fx)\n",
            n, t_c, t_g, t_c / t_g, t_c2, t_g2, t_c2 / t_g2)
    end
    return dims, cpu_gd, gpu_gd, cpu_cg, gpu_cg
end

# ── Panel 2: Grassmann — Rayleigh quotient with dense matrix ────────
#
# Given dense symmetric A ∈ R^(n×n), find the k-dim dominant subspace:
#   f(P)      = -tr(P' A P)          O(n²k) per cost evaluation
#   grad_f(P) = -2(AP - P(P'AP))     O(n²k) per gradient evaluation

function bench_grassmann(; n_iters=300)
    configs = [
        (64,   16),
        (128,  32),
        (256,  64),
        (512,  128),
        (1024, 256),
        (2048, 512),
    ]

    cpu_times = Float64[]
    gpu_times = Float64[]

    println("\n── Panel 2: Grassmann Gr(n,k) — Rayleigh quotient ($n_iters iters) ──")
    for (n, k) in configs
        M = Grassmann(n, k)
        rng = MersenneTwister(42)

        B = randn(rng, n, n)
        A_c = Symmetric((B' * B) ./ n)
        A_g = CuArray(Matrix(A_c))

        f_cpu(M, P) = -tr(P' * A_c * P)
        f_gpu(M, P) = -tr(P' * A_g * P)

        function grad_f_cpu(M, P)
            AP = A_c * P
            return -2.0 .* (AP .- P * (P' * AP))
        end
        function grad_f_gpu(M, P)
            AP = A_g * P
            return -2.0 .* (AP .- P * (P' * AP))
        end

        p0_c = Matrix(qr(randn(rng, n, k)).Q)
        p0_g = CuArray(p0_c)

        stop = StopAfterIteration(n_iters)
        ss = Manopt.ConstantStepsize(M, 0.002)
        retraction = QRRetraction()

        t_c, _ = bench(() -> gradient_descent(M, f_cpu, grad_f_cpu, copy(p0_c);
            stepsize=ss, retraction_method=retraction, stopping_criterion=stop), 3)
        t_g, _ = bench(() -> gradient_descent(M, f_gpu, grad_f_gpu, copy(p0_g);
            stepsize=ss, retraction_method=retraction, stopping_criterion=stop), 3)

        push!(cpu_times, t_c); push!(gpu_times, t_g)

        @printf("  Gr(%4d,%3d)  CPU: %.3e s  GPU: %.3e s  speedup: %5.1fx\n",
            n, k, t_c, t_g, t_c / t_g)
    end
    return configs, cpu_times, gpu_times
end

# ── Panel 3: Euclidean — Dense quadratic f(p) = ½p'Ap - b'p ────────
#
# With dense SPD matrix A (n×n), each gradient evaluation is O(n²):
#   f(p)      = ½ p' A p - b' p
#   grad_f(p) = A p - b

function bench_euclidean_dense(; n_iters=300)
    dims = [128, 512, 2048, 8192]

    cpu_gd = Float64[]
    gpu_gd = Float64[]
    cpu_ad = Float64[]
    gpu_ad = Float64[]

    backend = ManifoldDiff.RiemannianProjectionBackend(AutoZygote())

    println("\n── Panel 3: Euclidean — Dense quadratic ½p'Ap ($n_iters iters) ──")
    for n in dims
        M = Euclidean(n)
        rng = MersenneTwister(42)

        B = randn(rng, n, n)
        A_c = (B' * B) ./ n + I
        b_c = randn(rng, n)
        A_g = CuArray(Matrix(A_c))
        b_g = CuArray(b_c)

        f_cpu(M, p) = 0.5 * dot(p, A_c * p) - dot(b_c, p)
        f_gpu(M, p) = 0.5 * dot(p, A_g * p) - dot(b_g, p)

        grad_f_cpu(M, p) = A_c * p .- b_c
        grad_f_gpu(M, p) = A_g * p .- b_g

        p0_c = zeros(n)
        p0_g = CuArray(p0_c)
        stop = StopAfterIteration(n_iters)
        ss = Manopt.ConstantStepsize(M, 0.001)

        t_c, _ = bench(() -> gradient_descent(M, f_cpu, grad_f_cpu, copy(p0_c);
            stepsize=ss, stopping_criterion=stop), 3)
        t_g, _ = bench(() -> gradient_descent(M, f_gpu, grad_f_gpu, copy(p0_g);
            stepsize=ss, stopping_criterion=stop), 3)
        push!(cpu_gd, t_c); push!(gpu_gd, t_g)

        f_ad_cpu(p) = 0.5 * dot(p, A_c * p) - dot(b_c, p)
        f_ad_gpu(p) = 0.5 * dot(p, A_g * p) - dot(b_g, p)
        grad_ad_cpu(M, p) = ManifoldDiff.gradient(M, f_ad_cpu, p, backend)
        grad_ad_gpu(M, p) = ManifoldDiff.gradient(M, f_ad_gpu, p, backend)

        t_c2, _ = bench(() -> gradient_descent(M, f_cpu, grad_ad_cpu, copy(p0_c);
            stepsize=ss, stopping_criterion=stop), 3)
        t_g2, _ = bench(() -> gradient_descent(M, f_gpu, grad_ad_gpu, copy(p0_g);
            stepsize=ss, stopping_criterion=stop), 3)
        push!(cpu_ad, t_c2); push!(gpu_ad, t_g2)

        @printf("  n=%5d  GD: CPU %.3e GPU %.3e (%5.1fx)  AD: CPU %.3e GPU %.3e (%5.1fx)\n",
            n, t_c, t_g, t_c / t_g, t_c2, t_g2, t_c2 / t_g2)
    end
    return dims, cpu_gd, gpu_gd, cpu_ad, gpu_ad
end

# ── Plotting ─────────────────────────────────────────────────────────

function make_plots(r_sphere, r_grass, r_euclid)
    fig = Figure(size=(1400, 1000), fontsize=13)

    cpu_color = :steelblue
    gpu_color = :coral
    speedup_color = :forestgreen

    dims1, cpu_gd, gpu_gd, cpu_cg, gpu_cg = r_sphere
    ax1 = Axis(fig[1, 1];
        title="Sphere S^(n-1) — 500-point Data Fitting",
        xlabel="Dimension n", ylabel="Time (seconds)",
        xscale=log2, yscale=log10,
    )
    scatterlines!(ax1, dims1, cpu_gd; label="CPU GD", color=cpu_color,
        marker=:circle, markersize=10, linewidth=2)
    scatterlines!(ax1, dims1, gpu_gd; label="GPU GD", color=gpu_color,
        marker=:circle, markersize=10, linewidth=2)
    scatterlines!(ax1, dims1, cpu_cg; label="CPU CG", color=cpu_color,
        marker=:diamond, markersize=10, linewidth=2, linestyle=:dash)
    scatterlines!(ax1, dims1, gpu_cg; label="GPU CG", color=gpu_color,
        marker=:diamond, markersize=10, linewidth=2, linestyle=:dash)
    axislegend(ax1; position=:lt, framevisible=true, labelsize=11)

    configs2, cpu_gr, gpu_gr = r_grass
    dims2 = [c[1] for c in configs2]
    ax2 = Axis(fig[1, 2];
        title="Grassmann Gr(n,k) — Rayleigh Quotient (dense A)",
        xlabel="Matrix size n", ylabel="Time (seconds)",
        xscale=log2, yscale=log10,
    )
    scatterlines!(ax2, dims2, cpu_gr; label="CPU", color=cpu_color,
        marker=:circle, markersize=10, linewidth=2)
    scatterlines!(ax2, dims2, gpu_gr; label="GPU", color=gpu_color,
        marker=:rect, markersize=10, linewidth=2)
    ax2b = Axis(fig[1, 2]; ylabel="Speedup", yaxisposition=:right, xscale=log2,
        yticklabelcolor=speedup_color, ylabelcolor=speedup_color)
    hidexdecorations!(ax2b); hidespines!(ax2b)
    scatterlines!(ax2b, dims2, cpu_gr ./ gpu_gr; color=speedup_color,
        marker=:diamond, markersize=10, linewidth=2, linestyle=:dot)
    hlines!(ax2b, [1.0]; color=:gray, linestyle=:dash, linewidth=1)
    axislegend(ax2; position=:lt, framevisible=true, labelsize=11)

    dims3, cpu_gd3, gpu_gd3, cpu_ad3, gpu_ad3 = r_euclid
    ax3 = Axis(fig[2, 1];
        title="Euclidean — Dense Quadratic ½p'Ap",
        xlabel="Dimension n", ylabel="Time (seconds)",
        xscale=log2, yscale=log10,
    )
    scatterlines!(ax3, dims3, cpu_gd3; label="CPU GD", color=cpu_color,
        marker=:circle, markersize=10, linewidth=2)
    scatterlines!(ax3, dims3, gpu_gd3; label="GPU GD", color=gpu_color,
        marker=:circle, markersize=10, linewidth=2)
    scatterlines!(ax3, dims3, cpu_ad3; label="CPU Zygote AD", color=cpu_color,
        marker=:utriangle, markersize=10, linewidth=2, linestyle=:dash)
    scatterlines!(ax3, dims3, gpu_ad3; label="GPU Zygote AD", color=gpu_color,
        marker=:utriangle, markersize=10, linewidth=2, linestyle=:dash)
    axislegend(ax3; position=:lt, framevisible=true, labelsize=11)

    ax4 = Axis(fig[2, 2];
        title="GPU Speedup Summary",
        xlabel="Problem size (increasing →)",
        ylabel="Speedup (CPU / GPU)",
    )
    sp_data = [
        ("Sphere GD",     cpu_gd ./ gpu_gd,     :circle),
        ("Sphere CG",     cpu_cg ./ gpu_cg,     :diamond),
        ("Grassmann GD",  cpu_gr ./ gpu_gr,      :rect),
        ("Euclid GD",     cpu_gd3 ./ gpu_gd3,   :utriangle),
        ("Euclid AD",     cpu_ad3 ./ gpu_ad3,    :star5),
    ]
    for (label, sp, marker) in sp_data
        scatterlines!(ax4, 1:length(sp), sp; label=label,
            marker=marker, markersize=10, linewidth=2)
    end
    hlines!(ax4, [1.0]; color=:gray, linestyle=:dash, linewidth=1, label="GPU = CPU")
    axislegend(ax4; position=:lt, framevisible=true, labelsize=10)

    Label(fig[0, :],
        "Manopt.jl Solver Benchmarks: CPU vs GPU — Compute-Intensive Objectives (RTX 3090)";
        fontsize=17, font=:bold)

    path = joinpath(OUTDIR, "benchmark_cpu_vs_gpu.png")
    save(path, fig; px_per_unit=2)
    println("\nSaved: $path")
    return fig
end

function main()
    println("=" ^ 72)
    println("  Manopt.jl Solver Benchmark — Compute-Intensive Objectives")
    println("  GPU:  $(CUDA.name(CUDA.device()))")
    println("  CUDA: $(CUDA.runtime_version())")
    println("  Each objective involves O(n²) or O(n·N) compute per iteration")
    println("=" ^ 72)

    r_sphere = bench_sphere()
    r_grass  = bench_grassmann()
    r_euclid = bench_euclidean_dense()

    make_plots(r_sphere, r_grass, r_euclid)
    println("\nDone.")
end

main()

How to run

julia --project=test examples/benchmark_cpu_vs_gpu.jl

Requires: CUDA, CairoMakie, Manifolds, ManifoldsBase, ManifoldDiff, Manopt, Zygote, ADTypes

zazabap · 2026-02-20T16:58:13Z

How to review these cross-repo changes

Four PRs, recommended review order (bottom-up by dependency):

ManifoldsBase.jl #253 — CUDA extension for allocate/allocate_on to preserve CuArray type
Manifolds.jl #856 — CUDA extension for GPU-native QR retraction, random points, matrix log on GeneralUnitaryMatrices
ManifoldDiff.jl #84 — Tests only: Zygote AD gradients on CuArrays
Manopt.jl #579 — Tests only: solvers (GD, CG, Armijo, recording) on CuArrays

PRs 1-2 add extension modules + tests. PRs 3-4 add test files only — no runtime code changes. Each PR is independently mergeable and all CUDA tests skip gracefully without a GPU.

No ManoptCUDAExt is needed because #577 (_produce_type) already handles GPU-aware allocation in stepsize constructors.

kellertuer · 2026-02-21T10:19:58Z

Thanks for that start. As a first comment I have 2 remarks on wording / phrasing

“Test suite” is a general function that is able to test something generically, cf. e.g. the (relatively new) Manifolds Test suite https://github.com/JuliaManifolds/Manifolds.jl/blob/master/src/test.jl (defined in the test extension) which allows to test functionality of a manifold – all functions a manifold can have. What you do here is add a test case.
“comprehensive” (cf. https://www.merriam-webster.com/dictionary/comprehensive) means “to cover completely”. On a conservative estimate, Manopt.jl provides 32 algorithms. Of these you test one (GD) a bit more and a second one (CG) exactly once.

I would suppose this is either AI slop (then please be more careful with AI) or a language barrier (then never mind, but take my comment into account).

I am not yet sure how to review this in general, since I solely work on MacOS, where I would only have Metal (see Metal.jl) available. See also Mateusz comment on asking about maybe better using https://github.com/JuliaGPU/GPUArrays.jl instead of a direct dependency on CUDA.jl maybe?

zazabap · 2026-02-21T10:30:44Z

Thanks for that start. As a first comment I have 2 remarks on wording / phrasing
1. “Test suite” is a general function that is able to test something generically, cf. e.g. the (relatively new) Manifolds Test suite https://github.com/JuliaManifolds/Manifolds.jl/blob/master/src/test.jl (defined in the test extension) which allows to test functionality of a manifold – _all_ functions a manifold can have. What you do here is add a test case.

2. “comprehensive” (cf. https://www.merriam-webster.com/dictionary/comprehensive) means “to cover completely”. On a conservative estimate, Manopt.jl provides 32 algorithms. Of these you test one (GD) a bit more and a second one (CG) exactly once.
I would suppose this is either AI slop (then please be more careful with AI) or a language barrier (then never mind, but take my comment into account).

I am not yet sure how to review this in general, since I solely work on MacOS, where I would only have Metal (see Metal.jl) available. See also Mateusz comment on asking about maybe better using https://github.com/JuliaGPU/GPUArrays.jl instead of a direct dependency on CUDA.jl maybe?

Thanks for indicating the mistake here! The entire Manopt.jl does have plenty of the algorithms which I check with the simplest one. For the other algorithms, I also consider the current coverage of the tests on GPUs is not enough. Since the update on the Manifolds.jl indicate some overrides are not implemented, I will gradually add the tests when implementations there finished.

kellertuer · 2026-02-21T10:40:57Z

For me it is also fine if you “start somewhere” and we first have a docs page that maybe describes which solvers/stepsizes/... are already tested/verified. Then just do neither call it test suite nor comprehensive – it would still be a very welcoming contribution.

zazabap · 2026-02-21T11:55:36Z

For me it is also fine if you “start somewhere” and we first have a docs page that maybe describes which solvers/stepsizes/... are already tested/verified. Then just do neither call it test suite nor comprehensive – it would still be a very welcoming contribution.

I see, and I would contribute on tests in following direction:

PR for Extension on Manifolds.jl tested and merged.
Add the Gradient Descent and Conjugate Gradient Descent tests based on CUDA.jl integration to the Manifold.jl
Contribute to docs page for solver and step size missing/complete GPU tests.

I also modified the title to the contribution in this PR.

kellertuer · 2026-02-24T02:54:28Z

As discussed already here

JuliaManifolds/Manifolds.jl#856 (comment)

for Manifolds.jl I think a major challenge here is to make sure the code works correctly and is tested. For now, I do not see how that could be done. Since I solely own two MacBooks, I can not even test the code myself.

From the current state of this PR, with no actual changes to the package besides a strong dependency on CUDA (why?) I am not yet sure how this continues.
If all necessary parts are actually resolved in the new https://github.com/JuliaManifolds/ManifoldsGPU.jl, then this PR would be obsolete. We can still check whether it then makes sense to write a tutorial as a Quarto notebook that is “prerendered” to Markdown (and that someone should check from time to time and re-render to make sure the support still works).
Maybe that is a good way to go :)

By the way, I saw you are (at least partly?) based in Tokyo? I am currently a guest scientist at RIKEN.

zazabap · 2026-02-26T09:12:53Z

@kellertuer Welcome to Tokyo! I am a part time research associate at RIKEN AIP, if you and the collaborator think it might be better work in the new ManifoldGPU.jl package, then it might be a better solution to contribute on that side : ). I will be back to Tokyo in the end of March.

If you lack the access of CUDA environment, I could actually provide one if possible since April. Our lab will purchase a new work station with RTX 5060Ti/ RTX 5070Ti. I will try to obtain the remote access for collaborators after officially discussed with PI in my lab.

kellertuer · 2026-02-26T09:33:27Z

Yes, working on that in ManifoldGPU.jl would be best.
We could afterwards add a tutorial here.

The simple reason is, that I would like to keep Manopt.jl and Manifolds.jl with a very high test coverage and for now there is no way to test (on CI) gpu code.
So the problem is not that I personally do not have access, but that also the CI that runs the tests and analyses and displays code coverage does not have such an access.

If at any time later such a CI is available we can still think about migrating the ManifoldGPU code over.

I am actually also still in Tokyo end of March.

mateuszbaran · 2026-02-26T13:13:35Z

Yes, ManifoldsGPU.jl is the right place for all GPU-related stuff for now. We will link to that repository from Manopt.jl and Manifolds.jl when it's usable enough to promote GPU support. We will also need a short tutorial explaining caveats of working with GPU. For example, Stiefel maniifold, on CPU, the QR retraction is generally the fastest one but on GPU it can only be partially batched currently. On the other hand SVD works fully batched, so polar retraction is preferable.

Also, one note for benchmarking: you should use in-place gradients in your benchmark scripts for better performance: https://manoptjl.org/stable/tutorials/InplaceGradient/#Speedup-using-in-place-evaluation .

kellertuer · 2026-03-16T02:21:16Z

Since we started the separate GPU package, I'll close this one for now. Feel free to reopen it and move it to the other repository, if that feels fitting.

zazabap · 2026-03-19T13:34:08Z

I consider at some point when the implementation on the ManifoldsGPU.jl is finished on GPUs. We could think of a method to test the Manopt.jl CUDA version. I will mark it as possible contribution in the future.

kellertuer · 2026-03-19T22:38:26Z

The challenge there is, that currently there is no CI available (free of charge) that has GPUs activated.

Since all the work on these packages is done voluntarily, that is I get zero money paid for developing, supporting, bug fixing these, I also have zero money to spent on paying for CIs.
So for now I do not see, how such a test should work.

I do also hope that the main work has to be done in Manifolds.jl; here maybe only a few small bugfixes are required, where the algorithms are not generic enough.

One thing we can think about is a small tutorial. That we can pre-render (see for example how that is done in ManoptExamples.jl) – and include then in the docs, where all current examples are re-run on every documentation generation.

mateuszbaran · 2026-03-20T09:00:03Z

Currently the only realistic option is to periodically run GPU tests locally and manually. It's still worth having those tests to sometimes check if GPU support still works, and maybe in the future we somehow get GPU-enabled CI.

A pre-rendered example in the tutorial would also be nice, sure. IIRC the hyperparameter optimization example is currently pre-rendered?

kellertuer · 2026-03-20T09:28:19Z

I am absolutely not a fan of a manual CI, that was already the reason for the separate package; but sure manually run docs are fine, some are already like that.

zazabap marked this pull request as draft February 20, 2026 16:01

zazabap mentioned this pull request Feb 20, 2026

Add CUDA extension for GPU-accelerated solvers #574

Closed

24 tasks

zazabap mentioned this pull request Feb 20, 2026

GPU-native manifold operations via CUDA extension JuliaManifolds/Manifolds.jl#856

Closed

zazabap changed the title ~~Add comprehensive GPU solver test suite (no extension needed)~~ Add comprehensive GPU solver test suite Feb 20, 2026

zazabap mentioned this pull request Feb 21, 2026

Feat: enhance the Manopt.jl and Manifold.jl with GPU feature nzy1997/ParametricDFT.jl#34

Closed

zazabap changed the title ~~Add comprehensive GPU solver test suite~~ Add GPU solver tests on Gradient Descent and Conjugate Gradient Descent Feb 21, 2026

kellertuer closed this Mar 16, 2026

Conversation

zazabap commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Which solvers work on GPU

Verified working (tested in this PR)

Expected to work (same code path, not individually tested)

Requirements for GPU solvers

Why no ManoptCUDAExt.jl?

GPU benchmark results (RTX 3090, 300 iterations)

Tests: 17/17 verified passing

Changes

Related PRs

Uh oh!

codecov Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zazabap commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark: CPU vs GPU — Compute-Intensive Objectives

Results (RTX 3090)

How to run

Uh oh!

zazabap commented Feb 20, 2026

How to review these cross-repo changes

Uh oh!

kellertuer commented Feb 21, 2026

Uh oh!

zazabap commented Feb 21, 2026

Uh oh!

kellertuer commented Feb 21, 2026

Uh oh!

zazabap commented Feb 21, 2026

Uh oh!

kellertuer commented Feb 24, 2026

Uh oh!

zazabap commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kellertuer commented Feb 26, 2026

Uh oh!

mateuszbaran commented Feb 26, 2026

Uh oh!

kellertuer commented Mar 16, 2026

Uh oh!

zazabap commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kellertuer commented Mar 19, 2026

Uh oh!

mateuszbaran commented Mar 20, 2026

Uh oh!

kellertuer commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zazabap commented Feb 20, 2026 •

edited

Loading

codecov Bot commented Feb 20, 2026 •

edited

Loading

zazabap commented Feb 20, 2026 •

edited

Loading

zazabap commented Feb 26, 2026 •

edited

Loading

zazabap commented Mar 19, 2026 •

edited

Loading