[CUSPARSE] create slices of sparse matrices using boolean masks#3032
[CUSPARSE] create slices of sparse matrices using boolean masks#3032maleadt merged 8 commits intoJuliaGPU:masterfrom
Conversation
|
Your PR no longer requires formatting changes. Thank you for your contribution! |
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: aabf559 | Previous: 01a0795 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100441 ns |
101155 ns |
0.99 |
array/accumulate/Float32/dims=1 |
76384 ns |
76811 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1583662.5 ns |
1585181 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143158.5 ns |
143761.5 ns |
1.00 |
array/accumulate/Float32/dims=2L |
656653.5 ns |
657146 ns |
1.00 |
array/accumulate/Int64/1d |
118378 ns |
118686 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79220 ns |
79898.5 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1694073 ns |
1693836 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155363 ns |
155970.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961384 ns |
961435.5 ns |
1.00 |
array/broadcast |
20447 ns |
20595 ns |
0.99 |
array/construct |
1300.35 ns |
1360 ns |
0.96 |
array/copy |
18749 ns |
18770 ns |
1.00 |
array/copyto!/cpu_to_gpu |
213448 ns |
214074.5 ns |
1.00 |
array/copyto!/gpu_to_cpu |
281725 ns |
282538.5 ns |
1.00 |
array/copyto!/gpu_to_gpu |
11352 ns |
11352 ns |
1 |
array/iteration/findall/bool |
131262 ns |
131993 ns |
0.99 |
array/iteration/findall/int |
147834 ns |
148953 ns |
0.99 |
array/iteration/findfirst/bool |
81393 ns |
81399 ns |
1.00 |
array/iteration/findfirst/int |
83708 ns |
83388 ns |
1.00 |
array/iteration/findmin/1d |
86996 ns |
87034 ns |
1.00 |
array/iteration/findmin/2d |
117120 ns |
116982 ns |
1.00 |
array/iteration/logical |
199699 ns |
200300.5 ns |
1.00 |
array/iteration/scalar |
67377 ns |
66902 ns |
1.01 |
array/permutedims/2d |
52720 ns |
52557 ns |
1.00 |
array/permutedims/3d |
53191.5 ns |
52858 ns |
1.01 |
array/permutedims/4d |
52327 ns |
51891 ns |
1.01 |
array/random/rand/Float32 |
12947 ns |
12938 ns |
1.00 |
array/random/rand/Int64 |
37236 ns |
30232 ns |
1.23 |
array/random/rand!/Float32 |
8517 ns |
8486.333333333334 ns |
1.00 |
array/random/rand!/Int64 |
26764 ns |
34141 ns |
0.78 |
array/random/randn/Float32 |
41380 ns |
37762.5 ns |
1.10 |
array/random/randn!/Float32 |
26645 ns |
31561 ns |
0.84 |
array/reductions/mapreduce/Float32/1d |
35411 ns |
34932 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40167 ns |
39395 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1L |
51692 ns |
52264 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
56570 ns |
56993 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
69206 ns |
69821.5 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
42851 ns |
42986 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
48347 ns |
42873 ns |
1.13 |
array/reductions/mapreduce/Int64/dims=1L |
87721 ns |
87741 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59691 ns |
60047 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
84689.5 ns |
85349 ns |
0.99 |
array/reductions/reduce/Float32/1d |
35715.5 ns |
35002.5 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
49584 ns |
39743 ns |
1.25 |
array/reductions/reduce/Float32/dims=1L |
52037 ns |
51984 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56907 ns |
57086 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70059 ns |
69836.5 ns |
1.00 |
array/reductions/reduce/Int64/1d |
42850.5 ns |
43090 ns |
0.99 |
array/reductions/reduce/Int64/dims=1 |
42436 ns |
42912 ns |
0.99 |
array/reductions/reduce/Int64/dims=1L |
87624 ns |
87746 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59926 ns |
59769 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84711 ns |
84712 ns |
1.00 |
array/reverse/1d |
18236 ns |
18240 ns |
1.00 |
array/reverse/1dL |
68886 ns |
68915 ns |
1.00 |
array/reverse/1dL_inplace |
65913 ns |
65875 ns |
1.00 |
array/reverse/1d_inplace |
8529 ns |
10243 ns |
0.83 |
array/reverse/2d |
20748 ns |
20594 ns |
1.01 |
array/reverse/2dL |
73039 ns |
72723 ns |
1.00 |
array/reverse/2dL_inplace |
65994 ns |
65965 ns |
1.00 |
array/reverse/2d_inplace |
10220 ns |
10404 ns |
0.98 |
array/sorting/1d |
2735313 ns |
2734392 ns |
1.00 |
array/sorting/2d |
1072679 ns |
1075205 ns |
1.00 |
array/sorting/by |
3305049 ns |
3315166 ns |
1.00 |
cuda/synchronization/context/auto |
1174.7 ns |
1168.9 ns |
1.00 |
cuda/synchronization/context/blocking |
930.6888888888889 ns |
934.3225806451613 ns |
1.00 |
cuda/synchronization/context/nonblocking |
7234.700000000001 ns |
8363.400000000001 ns |
0.87 |
cuda/synchronization/stream/auto |
1021.5882352941177 ns |
1040.1333333333334 ns |
0.98 |
cuda/synchronization/stream/blocking |
787.747572815534 ns |
841.1473684210526 ns |
0.94 |
cuda/synchronization/stream/nonblocking |
7399.6 ns |
8156.4 ns |
0.91 |
integration/byval/reference |
143949 ns |
144104 ns |
1.00 |
integration/byval/slices=1 |
145864 ns |
145896 ns |
1.00 |
integration/byval/slices=2 |
284558 ns |
284543 ns |
1.00 |
integration/byval/slices=3 |
423124 ns |
423283 ns |
1.00 |
integration/cudadevrt |
102623 ns |
102484 ns |
1.00 |
integration/volumerhs |
23467460.5 ns |
23421671.5 ns |
1.00 |
kernel/indexing |
13254 ns |
13338 ns |
0.99 |
kernel/indexing_checked |
14088.5 ns |
14091 ns |
1.00 |
kernel/launch |
2191.1111111111113 ns |
2181.166666666667 ns |
1.00 |
kernel/occupancy |
722.158940397351 ns |
705.1118421052631 ns |
1.02 |
kernel/rand |
17967 ns |
15255 ns |
1.18 |
latency/import |
3812112212 ns |
3822749737.5 ns |
1.00 |
latency/precompile |
4595809169 ns |
4597359574 ns |
1.00 |
latency/ttfp |
4400844606.5 ns |
4411959980 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
|
The new tests uncovered a CSC to CSR conversion error in CUDA runtime 12.0 (fixed in 12.1 and above). I updated the tests to go through the COO format which hopefully avoids the buggy codepath. MWE on Runtime 12.0 # version 12.0
julia> using CUDA, SparseArrays, CUDA.CUSPARSE
julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
1.0 2.0 ⋅ 3.0 4.0
⋅ ⋅ 5.0 ⋅ ⋅
julia> dA = CuSparseMatrixCSR(A) # wrong!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
1.0 2.0 5.0 3.0 ⋅
⋅ ⋅ ⋅ ⋅ 4.0
julia> CUDA.@allowscalar dA[1,1] # wrong!
0.0MWE on Runtime 12.1 # version 12.1
julia> using CUDA, SparseArrays, CUDA.CUSPARSE
julia> A = sparse([1,1,1,1,2], [1,2,4,5,3], [1.0,2.0,3.0,4.0,5.0], 2, 5)
2×5 SparseMatrixCSC{Float64, Int64} with 5 stored entries:
1.0 2.0 ⋅ 3.0 4.0
⋅ ⋅ 5.0 ⋅ ⋅
julia> dA = CuSparseMatrixCSR(A) # correct!
2×5 CuSparseMatrixCSR{Float64, Int32} with 5 stored entries:
1.0 2.0 ⋅ 3.0 4.0
⋅ ⋅ 5.0 ⋅ ⋅
julia> CUDA.@allowscalar dA[1,1] # correct!
1.0 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3032 +/- ##
==========================================
- Coverage 90.43% 90.32% -0.12%
==========================================
Files 141 141
Lines 12025 12165 +140
==========================================
+ Hits 10875 10988 +113
- Misses 1150 1177 +27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Gentle bump on this. Is there something I can do to support review? EDIT: I rebased, github UI was showing some apparent merge conflicts without showing my where the problems lie. |
|
Looks like this might need some TLC to get up to date with the (unreleased) CUDA.jl 6 and package splitting |
|
I think the problems related to the Library split are fixed (just some CUDA -> CUDACore renamings). There are three remaning test fails I don't quite understand but seem unrelated?
|
|
Nightly and Enzyme are allowed to fail, and the multiGPU one is intermittent. So no relevant CI failures. |
CSC -> COO -> CSR and collect(CSR) is all broken on 12.0 Therefore, we go the full manual route now, constructing CSR from CSC(Transpose(A)) and compare element wise.
- guard against accessing last element of emty vector - use `checkbounds(A..)` to create proper error - add tests: - check for BoundsError thrown - zero mask (full collapse of dimension) - matrices of size (0,n) and (m,0)
maleadt
left a comment
There was a problem hiding this comment.
LGTM. Too bad about the scalar getindexes that essentially synchronize the GPU before doing any work, but I'm not familiar with the domain to suggest alternatives here.
Slicing into a sparse matrix by providing
Vector{Bool}per axis to select a subset of rows and cols is quite usefull. It is implemented for several types, for example "normal" matrices and sparse matrices.This PR implements this functionality for
CuSparseMatrixCSRandCuSparseMatrixCSC.AI Disclaimer: Claude code and Chat GPT helped me to implement the kernels. But I think they are straight forward and it is tested against the slicing results of CPU code.