Conversation
`PerDevice.get!` cached `(context(), value)`, where `context()` is the currently-active context, but subsequent lookups compared against `device_context(id)` — the target device's context. Whenever `get!(x, dev)` was invoked from a context belonging to a *different* device (e.g., `pool_create(other_dev)` called from inside `context!(context(src))`), the comparison mismatched on every later lookup and the constructor ran again, creating a fresh value per call and leaking the previous one. Store `context(dev)` instead, so the cache key matches the lookup key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Base.convert(::Type{CuPtr{T}}, ::Managed)` granted pool-level peer
access via `cuMemPoolSetAccess` on every cross-device pointer
conversion. This worked around an apparent CUDA driver bug by making it
much worse: a minimal C reproducer confirms that a single
`cuMemPoolSetAccess` call on a stream-ordered pool — even the
documented once-at-creation pattern, done before any allocations come
out of the pool — causes subsequent peer-direction data writes into
allocations from that pool (whether via `cuMemcpyPeerAsync` or via a
kernel on the peer device) to silently write zeros on driver 590.48.01
/ CUDA 13.2 / Turing sm_75. The API returns `CUDA_SUCCESS` and
`cuMemPoolGetAccess` reports the access is set, but the data-plane
write is dropped. `compute-sanitizer` additionally flags each call
with a bogus "HOST/HOST_NUMA pools are always read-write accessible on
the HOST" warning even though the access descriptor is
`CU_MEM_LOCATION_TYPE_DEVICE` on a device pool. Reported upstream
as NVIDIA bug #6098762.
`cuMemcpyPeerAsync` is a driver-mediated copy that only requires
context-level peer access (`cuCtxEnablePeerAccess`, already enabled
above) — not pool-level access — so removing the call fixes `copyto!`
between CuArrays on different devices without needing the
driver-bug-triggering API call. Callers that genuinely need
cross-device kernel access (e.g., cuBLASXt) already configure pool
access themselves and are unaffected by this change (though they will
still hit the driver bug in the same way the pre-fix code did).
Fixes the flaky "issue 1136: copies between devices" testset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[only special]
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3112 +/- ##
==========================================
+ Coverage 10.19% 16.56% +6.36%
==========================================
Files 119 120 +1
Lines 9198 9594 +396
==========================================
+ Hits 938 1589 +651
+ Misses 8260 8005 -255 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 08bb31d | Previous: e0e295f | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101551 ns |
100878 ns |
1.01 |
array/accumulate/Float32/dims=1 |
77097 ns |
75855 ns |
1.02 |
array/accumulate/Float32/dims=1L |
1593830 ns |
1585504 ns |
1.01 |
array/accumulate/Float32/dims=2 |
144382 ns |
143115.5 ns |
1.01 |
array/accumulate/Float32/dims=2L |
660133 ns |
657101 ns |
1.00 |
array/accumulate/Int64/1d |
118831 ns |
118250 ns |
1.00 |
array/accumulate/Int64/dims=1 |
80265 ns |
79820.5 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1704679 ns |
1694871 ns |
1.01 |
array/accumulate/Int64/dims=2 |
156762 ns |
155746 ns |
1.01 |
array/accumulate/Int64/dims=2L |
961755 ns |
961802 ns |
1.00 |
array/broadcast |
20353 ns |
20486 ns |
0.99 |
array/construct |
1301.7 ns |
1263.9 ns |
1.03 |
array/copy |
18210 ns |
17962 ns |
1.01 |
array/copyto!/cpu_to_gpu |
215057 ns |
214197 ns |
1.00 |
array/copyto!/gpu_to_cpu |
283657 ns |
281343 ns |
1.01 |
array/copyto!/gpu_to_gpu |
10929 ns |
10794 ns |
1.01 |
array/iteration/findall/bool |
134807 ns |
134478 ns |
1.00 |
array/iteration/findall/int |
149792 ns |
149314.5 ns |
1.00 |
array/iteration/findfirst/bool |
81542.5 ns |
81113 ns |
1.01 |
array/iteration/findfirst/int |
84112 ns |
83293 ns |
1.01 |
array/iteration/findmin/1d |
85798.5 ns |
84555 ns |
1.01 |
array/iteration/findmin/2d |
116649 ns |
116516 ns |
1.00 |
array/iteration/logical |
199215.5 ns |
197262.5 ns |
1.01 |
array/iteration/scalar |
67829 ns |
67092 ns |
1.01 |
array/permutedims/2d |
52186 ns |
52211 ns |
1.00 |
array/permutedims/3d |
52766 ns |
52764 ns |
1.00 |
array/permutedims/4d |
51720 ns |
51452 ns |
1.01 |
array/random/rand/Float32 |
12958 ns |
12943 ns |
1.00 |
array/random/rand/Int64 |
25183 ns |
24996 ns |
1.01 |
array/random/rand!/Float32 |
8472 ns |
8402.333333333334 ns |
1.01 |
array/random/rand!/Int64 |
21843 ns |
21937 ns |
1.00 |
array/random/randn/Float32 |
38056.5 ns |
36954 ns |
1.03 |
array/random/randn!/Float32 |
31051 ns |
30982 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
35035 ns |
34678 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
40090 ns |
39206 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1L |
51351 ns |
51259.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56448.5 ns |
56274 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69248 ns |
69346 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
42631 ns |
42412 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
42885 ns |
42188 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1L |
87135 ns |
87287 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59453 ns |
59630 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84732 ns |
84743 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35232 ns |
34235 ns |
1.03 |
array/reductions/reduce/Float32/dims=1 |
48638 ns |
39618.5 ns |
1.23 |
array/reductions/reduce/Float32/dims=1L |
51287 ns |
51305 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56586 ns |
56667 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69400 ns |
69784 ns |
0.99 |
array/reductions/reduce/Int64/1d |
42663 ns |
42369 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
47447.5 ns |
42478 ns |
1.12 |
array/reductions/reduce/Int64/dims=1L |
87063 ns |
87248 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59467 ns |
59729 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84381 ns |
84769 ns |
1.00 |
array/reverse/1d |
17779 ns |
18015.5 ns |
0.99 |
array/reverse/1dL |
68359 ns |
68638 ns |
1.00 |
array/reverse/1dL_inplace |
65696 ns |
65779 ns |
1.00 |
array/reverse/1d_inplace |
8475.333333333334 ns |
8649.666666666666 ns |
0.98 |
array/reverse/2d |
20773 ns |
20711 ns |
1.00 |
array/reverse/2dL |
72907 ns |
72634 ns |
1.00 |
array/reverse/2dL_inplace |
65831 ns |
65985 ns |
1.00 |
array/reverse/2d_inplace |
9983 ns |
10088 ns |
0.99 |
array/sorting/1d |
2744620 ns |
2734295 ns |
1.00 |
array/sorting/2d |
1072540 ns |
1068343 ns |
1.00 |
array/sorting/by |
3314456 ns |
3304353 ns |
1.00 |
cuda/synchronization/context/auto |
1120.9 ns |
1159.9 ns |
0.97 |
cuda/synchronization/context/blocking |
921.0555555555555 ns |
896.4878048780488 ns |
1.03 |
cuda/synchronization/context/nonblocking |
7122 ns |
7409.1 ns |
0.96 |
cuda/synchronization/stream/auto |
1002.5454545454545 ns |
1027.578947368421 ns |
0.98 |
cuda/synchronization/stream/blocking |
793.9795918367347 ns |
841.2941176470588 ns |
0.94 |
cuda/synchronization/stream/nonblocking |
7377.4 ns |
7567.799999999999 ns |
0.97 |
integration/byval/reference |
143725 ns |
143876 ns |
1.00 |
integration/byval/slices=1 |
145722 ns |
145738.5 ns |
1.00 |
integration/byval/slices=2 |
284373 ns |
284423 ns |
1.00 |
integration/byval/slices=3 |
423145 ns |
423173 ns |
1.00 |
integration/cudadevrt |
102373 ns |
102437 ns |
1.00 |
integration/volumerhs |
23469620.5 ns |
23470585 ns |
1.00 |
kernel/indexing |
13127 ns |
13311 ns |
0.99 |
kernel/indexing_checked |
13950 ns |
14095 ns |
0.99 |
kernel/launch |
2079.6666666666665 ns |
2235.1111111111113 ns |
0.93 |
kernel/occupancy |
671.243670886076 ns |
693.6190476190476 ns |
0.97 |
kernel/rand |
14274 ns |
18172.5 ns |
0.79 |
latency/import |
3826822993 ns |
3820990542 ns |
1.00 |
latency/precompile |
4595981474.5 ns |
4593009584 ns |
1.00 |
latency/ttfp |
4416850028.5 ns |
4397252952 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Calling
cuMemPoolSetAccesssomehow seems to breakcuMemcpyPeerAsync. MWE:Replacing the memcpy by a kernel-based one doesn't help. So temporarily disabling this until I hear back from NVIDIA.
Works around #2930