Skip to content

[CUSPARSE] create slices of sparse matrices using boolean masks#3032

Merged
maleadt merged 8 commits intoJuliaGPU:masterfrom
hexaeder:master
Apr 14, 2026
Merged

[CUSPARSE] create slices of sparse matrices using boolean masks#3032
maleadt merged 8 commits intoJuliaGPU:masterfrom
hexaeder:master

Conversation

@hexaeder
Copy link
Copy Markdown
Contributor

Slicing into a sparse matrix by providing Vector{Bool} per axis to select a subset of rows and cols is quite usefull. It is implemented for several types, for example "normal" matrices and sparse matrices.

N = 10
A = sprand(N, N, 0.5)
rowmask = [rand([true, false]) for _ in 1:size(A, 1)]
colmask = [rand([true, false]) for _ in 1:size(A, 2)]
S = A[rowmask, colmask]

This PR implements this functionality for CuSparseMatrixCSR and CuSparseMatrixCSC.

AI Disclaimer: Claude code and Chat GPT helped me to implement the kernels. But I think they are straight forward and it is tested against the slicing results of CPU code.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 20, 2026

Your PR no longer requires formatting changes. Thank you for your contribution!

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: aabf559 Previous: 01a0795 Ratio
array/accumulate/Float32/1d 100441 ns 101155 ns 0.99
array/accumulate/Float32/dims=1 76384 ns 76811 ns 0.99
array/accumulate/Float32/dims=1L 1583662.5 ns 1585181 ns 1.00
array/accumulate/Float32/dims=2 143158.5 ns 143761.5 ns 1.00
array/accumulate/Float32/dims=2L 656653.5 ns 657146 ns 1.00
array/accumulate/Int64/1d 118378 ns 118686 ns 1.00
array/accumulate/Int64/dims=1 79220 ns 79898.5 ns 0.99
array/accumulate/Int64/dims=1L 1694073 ns 1693836 ns 1.00
array/accumulate/Int64/dims=2 155363 ns 155970.5 ns 1.00
array/accumulate/Int64/dims=2L 961384 ns 961435.5 ns 1.00
array/broadcast 20447 ns 20595 ns 0.99
array/construct 1300.35 ns 1360 ns 0.96
array/copy 18749 ns 18770 ns 1.00
array/copyto!/cpu_to_gpu 213448 ns 214074.5 ns 1.00
array/copyto!/gpu_to_cpu 281725 ns 282538.5 ns 1.00
array/copyto!/gpu_to_gpu 11352 ns 11352 ns 1
array/iteration/findall/bool 131262 ns 131993 ns 0.99
array/iteration/findall/int 147834 ns 148953 ns 0.99
array/iteration/findfirst/bool 81393 ns 81399 ns 1.00
array/iteration/findfirst/int 83708 ns 83388 ns 1.00
array/iteration/findmin/1d 86996 ns 87034 ns 1.00
array/iteration/findmin/2d 117120 ns 116982 ns 1.00
array/iteration/logical 199699 ns 200300.5 ns 1.00
array/iteration/scalar 67377 ns 66902 ns 1.01
array/permutedims/2d 52720 ns 52557 ns 1.00
array/permutedims/3d 53191.5 ns 52858 ns 1.01
array/permutedims/4d 52327 ns 51891 ns 1.01
array/random/rand/Float32 12947 ns 12938 ns 1.00
array/random/rand/Int64 37236 ns 30232 ns 1.23
array/random/rand!/Float32 8517 ns 8486.333333333334 ns 1.00
array/random/rand!/Int64 26764 ns 34141 ns 0.78
array/random/randn/Float32 41380 ns 37762.5 ns 1.10
array/random/randn!/Float32 26645 ns 31561 ns 0.84
array/reductions/mapreduce/Float32/1d 35411 ns 34932 ns 1.01
array/reductions/mapreduce/Float32/dims=1 40167 ns 39395 ns 1.02
array/reductions/mapreduce/Float32/dims=1L 51692 ns 52264 ns 0.99
array/reductions/mapreduce/Float32/dims=2 56570 ns 56993 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 69206 ns 69821.5 ns 0.99
array/reductions/mapreduce/Int64/1d 42851 ns 42986 ns 1.00
array/reductions/mapreduce/Int64/dims=1 48347 ns 42873 ns 1.13
array/reductions/mapreduce/Int64/dims=1L 87721 ns 87741 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59691 ns 60047 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 84689.5 ns 85349 ns 0.99
array/reductions/reduce/Float32/1d 35715.5 ns 35002.5 ns 1.02
array/reductions/reduce/Float32/dims=1 49584 ns 39743 ns 1.25
array/reductions/reduce/Float32/dims=1L 52037 ns 51984 ns 1.00
array/reductions/reduce/Float32/dims=2 56907 ns 57086 ns 1.00
array/reductions/reduce/Float32/dims=2L 70059 ns 69836.5 ns 1.00
array/reductions/reduce/Int64/1d 42850.5 ns 43090 ns 0.99
array/reductions/reduce/Int64/dims=1 42436 ns 42912 ns 0.99
array/reductions/reduce/Int64/dims=1L 87624 ns 87746 ns 1.00
array/reductions/reduce/Int64/dims=2 59926 ns 59769 ns 1.00
array/reductions/reduce/Int64/dims=2L 84711 ns 84712 ns 1.00
array/reverse/1d 18236 ns 18240 ns 1.00
array/reverse/1dL 68886 ns 68915 ns 1.00
array/reverse/1dL_inplace 65913 ns 65875 ns 1.00
array/reverse/1d_inplace 8529 ns 10243 ns 0.83
array/reverse/2d 20748 ns 20594 ns 1.01
array/reverse/2dL 73039 ns 72723 ns 1.00
array/reverse/2dL_inplace 65994 ns 65965 ns 1.00
array/reverse/2d_inplace 10220 ns 10404 ns 0.98
array/sorting/1d 2735313 ns 2734392 ns 1.00
array/sorting/2d 1072679 ns 1075205 ns 1.00
array/sorting/by 3305049 ns 3315166 ns 1.00
cuda/synchronization/context/auto 1174.7 ns 1168.9 ns 1.00
cuda/synchronization/context/blocking 930.6888888888889 ns 934.3225806451613 ns 1.00
cuda/synchronization/context/nonblocking 7234.700000000001 ns 8363.400000000001 ns 0.87
cuda/synchronization/stream/auto 1021.5882352941177 ns 1040.1333333333334 ns 0.98
cuda/synchronization/stream/blocking 787.747572815534 ns 841.1473684210526 ns 0.94
cuda/synchronization/stream/nonblocking 7399.6 ns 8156.4 ns 0.91
integration/byval/reference 143949 ns 144104 ns 1.00
integration/byval/slices=1 145864 ns 145896 ns 1.00
integration/byval/slices=2 284558 ns 284543 ns 1.00
integration/byval/slices=3 423124 ns 423283 ns 1.00
integration/cudadevrt 102623 ns 102484 ns 1.00
integration/volumerhs 23467460.5 ns 23421671.5 ns 1.00
kernel/indexing 13254 ns 13338 ns 0.99
kernel/indexing_checked 14088.5 ns 14091 ns 1.00
kernel/launch 2191.1111111111113 ns 2181.166666666667 ns 1.00
kernel/occupancy 722.158940397351 ns 705.1118421052631 ns 1.02
kernel/rand 17967 ns 15255 ns 1.18
latency/import 3812112212 ns 3822749737.5 ns 1.00
latency/precompile 4595809169 ns 4597359574 ns 1.00
latency/ttfp 4400844606.5 ns 4411959980 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hexaeder
Copy link
Copy Markdown
Contributor Author

hexaeder commented Mar 4, 2026

The new tests uncovered a CSC to CSR conversion error in CUDA runtime 12.0 (fixed in 12.1 and above).
The colVal stored in the CSR matrix is just wrong, which messes printout, indexing and everything.
Should this be tracked or is it not usefull since it appears to be an upstream bug of that CUDA version?

I updated the tests to go through the COO format which hopefully avoids the buggy codepath.

MWE on Runtime 12.0

# version 12.0
julia> using CUDA, SparseArrays, CUDA.CUSPARSE

julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 1.0  2.0      3.0  4.0
         5.0       

julia> dA = CuSparseMatrixCSR(A) # wrong!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
 1.0  2.0  5.0  3.0   
                 4.0

julia> CUDA.@allowscalar dA[1,1] # wrong!
0.0

MWE on Runtime 12.1

# version 12.1
julia> using CUDA, SparseArrays, CUDA.CUSPARSE

julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 1.0  2.0      3.0  4.0
         5.0       

julia> dA = CuSparseMatrixCSR(A) # correct!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
 1.0  2.0      3.0  4.0
         5.0       

julia> CUDA.@allowscalar dA[1,1] # correct!
1.0

@maleadt maleadt requested review from amontoison and kshyatt March 4, 2026 11:31
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 80.71429% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.32%. Comparing base (01a0795) to head (aabf559).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
lib/cusparse/src/array.jl 49.05% 27 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3032      +/-   ##
==========================================
- Coverage   90.43%   90.32%   -0.12%     
==========================================
  Files         141      141              
  Lines       12025    12165     +140     
==========================================
+ Hits        10875    10988     +113     
- Misses       1150     1177      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hexaeder
Copy link
Copy Markdown
Contributor Author

hexaeder commented Apr 7, 2026

Gentle bump on this. Is there something I can do to support review?

EDIT: I rebased, github UI was showing some apparent merge conflicts without showing my where the problems lie.

@kshyatt
Copy link
Copy Markdown
Member

kshyatt commented Apr 7, 2026

Looks like this might need some TLC to get up to date with the (unreleased) CUDA.jl 6 and package splitting

@hexaeder
Copy link
Copy Markdown
Contributor Author

hexaeder commented Apr 8, 2026

I think the problems related to the Library split are fixed (just some CUDA -> CUDACore renamings). There are three remaning test fails I don't quite understand but seem unrelated?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Apr 9, 2026

Nightly and Enzyme are allowed to fail, and the multiGPU one is intermittent. So no relevant CI failures.

Comment thread lib/cusparse/src/array.jl Outdated
CSC -> COO -> CSR and collect(CSR) is all broken on 12.0
Therefore, we go the full manual route now, constructing
CSR from CSC(Transpose(A)) and compare element wise.
- guard against accessing last element of emty vector
- use `checkbounds(A..)` to create proper error
- add tests:
  - check for BoundsError thrown
  - zero mask (full collapse of dimension)
  - matrices of size (0,n) and (m,0)
Copy link
Copy Markdown
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Too bad about the scalar getindexes that essentially synchronize the GPU before doing any work, but I'm not familiar with the domain to suggest alternatives here.

@maleadt maleadt merged commit 5065018 into JuliaGPU:master Apr 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants