Fix GC race in CuRef getindex causing intermittent CUDA errors#3087
Merged
Fix GC race in CuRef getindex causing intermittent CUDA errors#3087
Conversation
The `getindex` methods for `CuRefValue` and `CuRefArray` only preserved the CPU `Ref` with `GC.@preserve`, but not the GPU reference. After extracting the raw device pointer via `unsafe_convert`, the GC could collect the `CuRefValue` (running its `pool_free` finalizer) before the `unsafe_copyto!` memcpy completed, resulting in use-after-free. This manifested as intermittent `CUDA error: invalid argument` or segfaults under GC pressure, particularly in multi-threaded workloads performing many CUBLAS operations (e.g., dot products in Arnoldi iteration). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3087 +/- ##
=======================================
Coverage 90.41% 90.42%
=======================================
Files 141 141
Lines 11993 11993
=======================================
+ Hits 10844 10845 +1
+ Misses 1149 1148 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: e49bd1b | Previous: 5f45772 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101132 ns |
101495 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76479 ns |
76898 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1583539 ns |
1585143.5 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143853 ns |
143801 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657727 ns |
657240 ns |
1.00 |
array/accumulate/Int64/1d |
118411 ns |
118623 ns |
1.00 |
array/accumulate/Int64/dims=1 |
80131 ns |
80572.5 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1695177 ns |
1693852 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155949 ns |
156484 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961597.5 ns |
961603 ns |
1.00 |
array/broadcast |
20514 ns |
20294 ns |
1.01 |
array/construct |
1331.2 ns |
1320.4 ns |
1.01 |
array/copy |
18720 ns |
18780 ns |
1.00 |
array/copyto!/cpu_to_gpu |
215223.5 ns |
214684 ns |
1.00 |
array/copyto!/gpu_to_cpu |
283574 ns |
282072 ns |
1.01 |
array/copyto!/gpu_to_gpu |
11408 ns |
11361 ns |
1.00 |
array/iteration/findall/bool |
131383 ns |
131719.5 ns |
1.00 |
array/iteration/findall/int |
148780 ns |
148883 ns |
1.00 |
array/iteration/findfirst/bool |
80906 ns |
81470.5 ns |
0.99 |
array/iteration/findfirst/int |
83533.5 ns |
83414 ns |
1.00 |
array/iteration/findmin/1d |
88432.5 ns |
89419 ns |
0.99 |
array/iteration/findmin/2d |
117090.5 ns |
117365 ns |
1.00 |
array/iteration/logical |
199596 ns |
207612 ns |
0.96 |
array/iteration/scalar |
67301 ns |
66780 ns |
1.01 |
array/permutedims/2d |
52399 ns |
52471.5 ns |
1.00 |
array/permutedims/3d |
52919 ns |
53137 ns |
1.00 |
array/permutedims/4d |
52303 ns |
52429 ns |
1.00 |
array/random/rand/Float32 |
13180 ns |
13089 ns |
1.01 |
array/random/rand/Int64 |
37361 ns |
37236 ns |
1.00 |
array/random/rand!/Float32 |
8520.333333333334 ns |
8527.666666666666 ns |
1.00 |
array/random/rand!/Int64 |
34437.5 ns |
34109.5 ns |
1.01 |
array/random/randn/Float32 |
44084 ns |
38147 ns |
1.16 |
array/random/randn!/Float32 |
31665 ns |
31640 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
35194 ns |
34735.5 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40837 ns |
40760 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
51944 ns |
51917 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56712 ns |
56503.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69408 ns |
69496.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
42852 ns |
42820 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1 |
42338.5 ns |
44181 ns |
0.96 |
array/reductions/mapreduce/Int64/dims=1L |
87836 ns |
87798 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59872 ns |
59808 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
85157 ns |
85232 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35522 ns |
34883 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
43353.5 ns |
39758 ns |
1.09 |
array/reductions/reduce/Float32/dims=1L |
52272 ns |
52166 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
57289 ns |
56925 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
70052 ns |
69909 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43171 ns |
42673 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
51441.5 ns |
42123 ns |
1.22 |
array/reductions/reduce/Int64/dims=1L |
87779 ns |
87782 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59682 ns |
59551 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84985.5 ns |
84796 ns |
1.00 |
array/reverse/1d |
18621 ns |
18432.5 ns |
1.01 |
array/reverse/1dL |
69138 ns |
69025 ns |
1.00 |
array/reverse/1dL_inplace |
65918 ns |
65968 ns |
1.00 |
array/reverse/1d_inplace |
8552.333333333334 ns |
10240.666666666666 ns |
0.84 |
array/reverse/2d |
20669 ns |
20709 ns |
1.00 |
array/reverse/2dL |
72745 ns |
72815 ns |
1.00 |
array/reverse/2dL_inplace |
66016 ns |
65992 ns |
1.00 |
array/reverse/2d_inplace |
10186 ns |
11117.5 ns |
0.92 |
array/sorting/1d |
2734661 ns |
2754859 ns |
0.99 |
array/sorting/2d |
1068455 ns |
1075967 ns |
0.99 |
array/sorting/by |
3303203 ns |
3328240 ns |
0.99 |
cuda/synchronization/context/auto |
1158.4 ns |
1192.4 ns |
0.97 |
cuda/synchronization/context/blocking |
947.8181818181819 ns |
947.7391304347826 ns |
1.00 |
cuda/synchronization/context/nonblocking |
7681.6 ns |
7660.1 ns |
1.00 |
cuda/synchronization/stream/auto |
1018.4545454545455 ns |
1032.5 ns |
0.99 |
cuda/synchronization/stream/blocking |
824.8139534883721 ns |
841.5588235294117 ns |
0.98 |
cuda/synchronization/stream/nonblocking |
7679.9 ns |
7189.6 ns |
1.07 |
integration/byval/reference |
144062 ns |
143997 ns |
1.00 |
integration/byval/slices=1 |
145976 ns |
145776 ns |
1.00 |
integration/byval/slices=2 |
284694 ns |
284427 ns |
1.00 |
integration/byval/slices=3 |
423386 ns |
423129 ns |
1.00 |
integration/cudadevrt |
102770 ns |
102598 ns |
1.00 |
integration/volumerhs |
9440456 ns |
9429742.5 ns |
1.00 |
kernel/indexing |
13404 ns |
13331 ns |
1.01 |
kernel/indexing_checked |
14139 ns |
14116 ns |
1.00 |
kernel/launch |
2199.222222222222 ns |
2147 ns |
1.02 |
kernel/occupancy |
679.9019607843137 ns |
660.5723270440252 ns |
1.03 |
kernel/rand |
16496 ns |
15598 ns |
1.06 |
latency/import |
3837634365 ns |
3807044359.5 ns |
1.01 |
latency/precompile |
4602411854.5 ns |
4590923492 ns |
1.00 |
latency/ttfp |
4415248089.5 ns |
4392969126 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
getindexmethods forCuRefValueandCuRefArrayonly preserved the CPURefwithGC.@preserve, but not the GPU reference. After extracting the raw device pointer viaunsafe_convert, the GC could collect theCuRefValue(running itspool_freefinalizer) before theunsafe_copyto!memcpy completed, resulting in use-after-free.Fixes #3012