[CUSPARSE] create slices of sparse matrices using boolean masks by hexaeder · Pull Request #3032 · JuliaGPU/CUDA.jl

hexaeder · 2026-02-20T14:00:05Z

Slicing into a sparse matrix by providing Vector{Bool} per axis to select a subset of rows and cols is quite usefull. It is implemented for several types, for example "normal" matrices and sparse matrices.

N = 10
A = sprand(N, N, 0.5)
rowmask = [rand([true, false]) for _ in 1:size(A, 1)]
colmask = [rand([true, false]) for _ in 1:size(A, 2)]
S = A[rowmask, colmask]

This PR implements this functionality for CuSparseMatrixCSR and CuSparseMatrixCSC.

AI Disclaimer: Claude code and Chat GPT helped me to implement the kernels. But I think they are straight forward and it is tested against the slicing results of CPU code.

github-actions · 2026-02-20T14:00:39Z

Your PR no longer requires formatting changes. Thank you for your contribution!

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `aabf559`	Previous: `01a0795`	Ratio
`array/accumulate/Float32/1d`	`100441` ns	`101155` ns	`0.99`
`array/accumulate/Float32/dims=1`	`76384` ns	`76811` ns	`0.99`
`array/accumulate/Float32/dims=1L`	`1583662.5` ns	`1585181` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143158.5` ns	`143761.5` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`656653.5` ns	`657146` ns	`1.00`
`array/accumulate/Int64/1d`	`118378` ns	`118686` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79220` ns	`79898.5` ns	`0.99`
`array/accumulate/Int64/dims=1L`	`1694073` ns	`1693836` ns	`1.00`
`array/accumulate/Int64/dims=2`	`155363` ns	`155970.5` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`961384` ns	`961435.5` ns	`1.00`
`array/broadcast`	`20447` ns	`20595` ns	`0.99`
`array/construct`	`1300.35` ns	`1360` ns	`0.96`
`array/copy`	`18749` ns	`18770` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`213448` ns	`214074.5` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`281725` ns	`282538.5` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`11352` ns	`11352` ns	`1`
`array/iteration/findall/bool`	`131262` ns	`131993` ns	`0.99`
`array/iteration/findall/int`	`147834` ns	`148953` ns	`0.99`
`array/iteration/findfirst/bool`	`81393` ns	`81399` ns	`1.00`
`array/iteration/findfirst/int`	`83708` ns	`83388` ns	`1.00`
`array/iteration/findmin/1d`	`86996` ns	`87034` ns	`1.00`
`array/iteration/findmin/2d`	`117120` ns	`116982` ns	`1.00`
`array/iteration/logical`	`199699` ns	`200300.5` ns	`1.00`
`array/iteration/scalar`	`67377` ns	`66902` ns	`1.01`
`array/permutedims/2d`	`52720` ns	`52557` ns	`1.00`
`array/permutedims/3d`	`53191.5` ns	`52858` ns	`1.01`
`array/permutedims/4d`	`52327` ns	`51891` ns	`1.01`
`array/random/rand/Float32`	`12947` ns	`12938` ns	`1.00`
`array/random/rand/Int64`	`37236` ns	`30232` ns	`1.23`
`array/random/rand!/Float32`	`8517` ns	`8486.333333333334` ns	`1.00`
`array/random/rand!/Int64`	`26764` ns	`34141` ns	`0.78`
`array/random/randn/Float32`	`41380` ns	`37762.5` ns	`1.10`
`array/random/randn!/Float32`	`26645` ns	`31561` ns	`0.84`
`array/reductions/mapreduce/Float32/1d`	`35411` ns	`34932` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`40167` ns	`39395` ns	`1.02`
`array/reductions/mapreduce/Float32/dims=1L`	`51692` ns	`52264` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2`	`56570` ns	`56993` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`69206` ns	`69821.5` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`42851` ns	`42986` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1`	`48347` ns	`42873` ns	`1.13`
`array/reductions/mapreduce/Int64/dims=1L`	`87721` ns	`87741` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59691` ns	`60047` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`84689.5` ns	`85349` ns	`0.99`
`array/reductions/reduce/Float32/1d`	`35715.5` ns	`35002.5` ns	`1.02`
`array/reductions/reduce/Float32/dims=1`	`49584` ns	`39743` ns	`1.25`
`array/reductions/reduce/Float32/dims=1L`	`52037` ns	`51984` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56907` ns	`57086` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`70059` ns	`69836.5` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`42850.5` ns	`43090` ns	`0.99`
`array/reductions/reduce/Int64/dims=1`	`42436` ns	`42912` ns	`0.99`
`array/reductions/reduce/Int64/dims=1L`	`87624` ns	`87746` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59926` ns	`59769` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84711` ns	`84712` ns	`1.00`
`array/reverse/1d`	`18236` ns	`18240` ns	`1.00`
`array/reverse/1dL`	`68886` ns	`68915` ns	`1.00`
`array/reverse/1dL_inplace`	`65913` ns	`65875` ns	`1.00`
`array/reverse/1d_inplace`	`8529` ns	`10243` ns	`0.83`
`array/reverse/2d`	`20748` ns	`20594` ns	`1.01`
`array/reverse/2dL`	`73039` ns	`72723` ns	`1.00`
`array/reverse/2dL_inplace`	`65994` ns	`65965` ns	`1.00`
`array/reverse/2d_inplace`	`10220` ns	`10404` ns	`0.98`
`array/sorting/1d`	`2735313` ns	`2734392` ns	`1.00`
`array/sorting/2d`	`1072679` ns	`1075205` ns	`1.00`
`array/sorting/by`	`3305049` ns	`3315166` ns	`1.00`
`cuda/synchronization/context/auto`	`1174.7` ns	`1168.9` ns	`1.00`
`cuda/synchronization/context/blocking`	`930.6888888888889` ns	`934.3225806451613` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`7234.700000000001` ns	`8363.400000000001` ns	`0.87`
`cuda/synchronization/stream/auto`	`1021.5882352941177` ns	`1040.1333333333334` ns	`0.98`
`cuda/synchronization/stream/blocking`	`787.747572815534` ns	`841.1473684210526` ns	`0.94`
`cuda/synchronization/stream/nonblocking`	`7399.6` ns	`8156.4` ns	`0.91`
`integration/byval/reference`	`143949` ns	`144104` ns	`1.00`
`integration/byval/slices=1`	`145864` ns	`145896` ns	`1.00`
`integration/byval/slices=2`	`284558` ns	`284543` ns	`1.00`
`integration/byval/slices=3`	`423124` ns	`423283` ns	`1.00`
`integration/cudadevrt`	`102623` ns	`102484` ns	`1.00`
`integration/volumerhs`	`23467460.5` ns	`23421671.5` ns	`1.00`
`kernel/indexing`	`13254` ns	`13338` ns	`0.99`
`kernel/indexing_checked`	`14088.5` ns	`14091` ns	`1.00`
`kernel/launch`	`2191.1111111111113` ns	`2181.166666666667` ns	`1.00`
`kernel/occupancy`	`722.158940397351` ns	`705.1118421052631` ns	`1.02`
`kernel/rand`	`17967` ns	`15255` ns	`1.18`
`latency/import`	`3812112212` ns	`3822749737.5` ns	`1.00`
`latency/precompile`	`4595809169` ns	`4597359574` ns	`1.00`
`latency/ttfp`	`4400844606.5` ns	`4411959980` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

hexaeder · 2026-03-04T11:07:00Z

The new tests uncovered a CSC to CSR conversion error in CUDA runtime 12.0 (fixed in 12.1 and above).
The colVal stored in the CSR matrix is just wrong, which messes printout, indexing and everything.
Should this be tracked or is it not usefull since it appears to be an upstream bug of that CUDA version?

I updated the tests to go through the COO format which hopefully avoids the buggy codepath.

MWE on Runtime 12.0

# version 12.0
julia> using CUDA, SparseArrays, CUDA.CUSPARSE

julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 1.0  2.0   ⋅   3.0  4.0
  ⋅    ⋅   5.0   ⋅    ⋅

julia> dA = CuSparseMatrixCSR(A) # wrong!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
 1.0  2.0  5.0  3.0   ⋅
  ⋅    ⋅    ⋅    ⋅   4.0

julia> CUDA.@allowscalar dA[1,1] # wrong!
0.0

MWE on Runtime 12.1

# version 12.1
julia> using CUDA, SparseArrays, CUDA.CUSPARSE

julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
 1.0  2.0   ⋅   3.0  4.0
  ⋅    ⋅   5.0   ⋅    ⋅

julia> dA = CuSparseMatrixCSR(A) # correct!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
 1.0  2.0   ⋅   3.0  4.0
  ⋅    ⋅   5.0   ⋅    ⋅

julia> CUDA.@allowscalar dA[1,1] # correct!
1.0

codecov · 2026-03-05T13:29:53Z

Codecov Report

❌ Patch coverage is 80.71429% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.32%. Comparing base (01a0795) to head (aabf559).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/cusparse/src/array.jl	49.05%	27 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3032      +/-   ##
==========================================
- Coverage   90.43%   90.32%   -0.12%     
==========================================
  Files         141      141              
  Lines       12025    12165     +140     
==========================================
+ Hits        10875    10988     +113     
- Misses       1150     1177      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hexaeder · 2026-04-07T07:56:12Z

Gentle bump on this. Is there something I can do to support review?

EDIT: I rebased, github UI was showing some apparent merge conflicts without showing my where the problems lie.

kshyatt · 2026-04-07T08:19:28Z

Looks like this might need some TLC to get up to date with the (unreleased) CUDA.jl 6 and package splitting

hexaeder · 2026-04-08T06:56:42Z

I think the problems related to the Library split are fixed (just some CUDA -> CUDACore renamings). There are three remaning test fails I don't quite understand but seem unrelated?

Multi GPU Tests failed on core/array tests
Julia Nightly fails because of out of bounds indexing in core/array test
Enzyme tests fail

maleadt · 2026-04-09T08:44:14Z

Nightly and Enzyme are allowed to fail, and the multiGPU one is intermittent. So no relevant CI failures.

CSC -> COO -> CSR and collect(CSR) is all broken on 12.0 Therefore, we go the full manual route now, constructing CSR from CSC(Transpose(A)) and compare element wise.

- guard against accessing last element of emty vector - use `checkbounds(A..)` to create proper error - add tests: - check for BoundsError thrown - zero mask (full collapse of dimension) - matrices of size (0,n) and (m,0)

maleadt

LGTM. Too bad about the scalar getindexes that essentially synchronize the GPU before doing any work, but I'm not familiar with the domain to suggest alternatives here.

github-actions Bot reviewed Feb 20, 2026

View reviewed changes

hexaeder mentioned this pull request Feb 23, 2026

Fix GPU compat of sparse dae solvers SciML/OrdinaryDiffEq.jl#3073

Draft

5 tasks

maleadt requested review from amontoison and kshyatt March 4, 2026 11:31

maleadt force-pushed the master branch from f1e7455 to 5a6f767 Compare March 26, 2026 08:13

hexaeder force-pushed the master branch from 99e9b34 to 63a4ec4 Compare April 7, 2026 08:05

maleadt reviewed Apr 9, 2026

View reviewed changes

Comment thread lib/cusparse/src/array.jl Outdated

hexaeder added 8 commits April 14, 2026 09:21

create sparse matrix slices by boolean masks

31637d4

implement separate count and fill kernels

09eff49

apply runic

bea5015

workaround: avoid CSC -> CSR conversion in test

0e5408d

workaround take 2: manual CSR construction

e3dc10e

CSC -> COO -> CSR and collect(CSR) is all broken on 12.0 Therefore, we go the full manual route now, constructing CSR from CSC(Transpose(A)) and compare element wise.

use CUDACore instead of CUDA

480b9a6

guard against empty maps

3d2560f

- guard against accessing last element of emty vector - use `checkbounds(A..)` to create proper error - add tests: - check for BoundsError thrown - zero mask (full collapse of dimension) - matrices of size (0,n) and (m,0)

specify macro source

aabf559

hexaeder force-pushed the master branch from d794209 to aabf559 Compare April 14, 2026 07:21

maleadt approved these changes Apr 14, 2026

View reviewed changes

maleadt merged commit 5065018 into JuliaGPU:master Apr 14, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUSPARSE] create slices of sparse matrices using boolean masks#3032

[CUSPARSE] create slices of sparse matrices using boolean masks#3032
maleadt merged 8 commits intoJuliaGPU:masterfrom
hexaeder:master

hexaeder commented Feb 20, 2026

Uh oh!

github-actions Bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

hexaeder commented Mar 4, 2026

Uh oh!

codecov Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

hexaeder commented Apr 7, 2026 •

edited

Loading

Uh oh!

kshyatt commented Apr 7, 2026

Uh oh!

hexaeder commented Apr 8, 2026

Uh oh!

maleadt commented Apr 9, 2026

Uh oh!

Uh oh!

maleadt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hexaeder commented Feb 20, 2026

Uh oh!

github-actions Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

hexaeder commented Mar 4, 2026

Uh oh!

codecov Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hexaeder commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kshyatt commented Apr 7, 2026

Uh oh!

hexaeder commented Apr 8, 2026

Uh oh!

maleadt commented Apr 9, 2026

Uh oh!

Uh oh!

maleadt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Feb 20, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

codecov Bot commented Mar 5, 2026 •

edited

Loading

hexaeder commented Apr 7, 2026 •

edited

Loading