Fix Int32 overflow in !range metadata for large intrinsics#3119
Open
rainerrodrigues wants to merge 6 commits intoJuliaGPU:masterfrom
Open
Fix Int32 overflow in !range metadata for large intrinsics#3119rainerrodrigues wants to merge 6 commits intoJuliaGPU:masterfrom
rainerrodrigues wants to merge 6 commits intoJuliaGPU:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 7e44cec | Previous: d627b3e | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101339 ns |
100853 ns |
1.00 |
array/accumulate/Float32/dims=1 |
77260 ns |
76362 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1586550 ns |
1585357 ns |
1.00 |
array/accumulate/Float32/dims=2 |
144263 ns |
143506 ns |
1.01 |
array/accumulate/Float32/dims=2L |
658401 ns |
657416 ns |
1.00 |
array/accumulate/Int64/1d |
118883.5 ns |
118459.5 ns |
1.00 |
array/accumulate/Int64/dims=1 |
80473 ns |
79599 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1695761.5 ns |
1694409 ns |
1.00 |
array/accumulate/Int64/dims=2 |
156095 ns |
155450.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
962295.5 ns |
960958 ns |
1.00 |
array/broadcast |
20701 ns |
20282 ns |
1.02 |
array/construct |
1307.7 ns |
1328.2 ns |
0.98 |
array/copy |
18275 ns |
17911 ns |
1.02 |
array/copyto!/cpu_to_gpu |
214231.5 ns |
212984 ns |
1.01 |
array/copyto!/gpu_to_cpu |
283233 ns |
280259 ns |
1.01 |
array/copyto!/gpu_to_gpu |
11021 ns |
10684 ns |
1.03 |
array/iteration/findall/bool |
134462 ns |
134327 ns |
1.00 |
array/iteration/findall/int |
149883 ns |
149222 ns |
1.00 |
array/iteration/findfirst/bool |
81431 ns |
81163 ns |
1.00 |
array/iteration/findfirst/int |
83343 ns |
82984 ns |
1.00 |
array/iteration/findmin/1d |
87141.5 ns |
85430.5 ns |
1.02 |
array/iteration/findmin/2d |
117107 ns |
116921 ns |
1.00 |
array/iteration/logical |
199980 ns |
196920 ns |
1.02 |
array/iteration/scalar |
68578 ns |
67453 ns |
1.02 |
array/permutedims/2d |
52164 ns |
52024.5 ns |
1.00 |
array/permutedims/3d |
52715 ns |
52690 ns |
1.00 |
array/permutedims/4d |
53830.5 ns |
51921.5 ns |
1.04 |
array/random/rand/Float32 |
12904 ns |
12625 ns |
1.02 |
array/random/rand/Int64 |
24944 ns |
24502 ns |
1.02 |
array/random/rand!/Float32 |
8426.333333333334 ns |
8842 ns |
0.95 |
array/random/rand!/Int64 |
21569 ns |
21520 ns |
1.00 |
array/random/randn/Float32 |
37511 ns |
38339 ns |
0.98 |
array/random/randn!/Float32 |
30819.5 ns |
30558 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
34615 ns |
34493 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
49961 ns |
39238 ns |
1.27 |
array/reductions/mapreduce/Float32/dims=1L |
51391 ns |
51115 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2 |
58098 ns |
56362 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=2L |
67940 ns |
68855 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
42975.5 ns |
41840 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=1 |
42620 ns |
43041.5 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1L |
87221 ns |
87062 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
60865 ns |
59400.5 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=2L |
84250 ns |
84650 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35253.5 ns |
34748 ns |
1.01 |
array/reductions/reduce/Float32/dims=1 |
48912 ns |
42567 ns |
1.15 |
array/reductions/reduce/Float32/dims=1L |
51580 ns |
51269 ns |
1.01 |
array/reductions/reduce/Float32/dims=2 |
58356 ns |
56421.5 ns |
1.03 |
array/reductions/reduce/Float32/dims=2L |
68854 ns |
69423 ns |
0.99 |
array/reductions/reduce/Int64/1d |
42724 ns |
41786 ns |
1.02 |
array/reductions/reduce/Int64/dims=1 |
42554 ns |
41701 ns |
1.02 |
array/reductions/reduce/Int64/dims=1L |
87190 ns |
87004 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
60933 ns |
59173 ns |
1.03 |
array/reductions/reduce/Int64/dims=2L |
84180 ns |
84474 ns |
1.00 |
array/reverse/1d |
17974 ns |
17689 ns |
1.02 |
array/reverse/1dL |
68610 ns |
68239 ns |
1.01 |
array/reverse/1dL_inplace |
65750 ns |
65528 ns |
1.00 |
array/reverse/1d_inplace |
10291.666666666666 ns |
8294.666666666666 ns |
1.24 |
array/reverse/2d |
20772 ns |
20296 ns |
1.02 |
array/reverse/2dL |
72889 ns |
72502 ns |
1.01 |
array/reverse/2dL_inplace |
65846 ns |
65556 ns |
1.00 |
array/reverse/2d_inplace |
10049 ns |
9743 ns |
1.03 |
array/sorting/1d |
2735535 ns |
2734808 ns |
1.00 |
array/sorting/2d |
1069343 ns |
1071711 ns |
1.00 |
array/sorting/by |
3304052 ns |
3304331 ns |
1.00 |
cuda/synchronization/context/auto |
1171.7 ns |
1162.7 ns |
1.01 |
cuda/synchronization/context/blocking |
923.5 ns |
928.3666666666667 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7085.6 ns |
6978.1 ns |
1.02 |
cuda/synchronization/stream/auto |
998.9642857142858 ns |
994.2105263157895 ns |
1.00 |
cuda/synchronization/stream/blocking |
804.0309278350516 ns |
840.6543209876543 ns |
0.96 |
cuda/synchronization/stream/nonblocking |
6989.2 ns |
7241.299999999999 ns |
0.97 |
integration/byval/reference |
143828 ns |
143552.5 ns |
1.00 |
integration/byval/slices=1 |
145614 ns |
145385 ns |
1.00 |
integration/byval/slices=2 |
284608 ns |
284202.5 ns |
1.00 |
integration/byval/slices=3 |
423154 ns |
422620 ns |
1.00 |
integration/cudadevrt |
102393 ns |
102213 ns |
1.00 |
integration/volumerhs |
23422519 ns |
23474593 ns |
1.00 |
kernel/indexing |
13226 ns |
12944 ns |
1.02 |
kernel/indexing_checked |
13872 ns |
13691.5 ns |
1.01 |
kernel/launch |
2106.8888888888887 ns |
2106.5555555555557 ns |
1.00 |
kernel/occupancy |
664.8993710691824 ns |
665.54375 ns |
1.00 |
kernel/rand |
14214 ns |
13998 ns |
1.02 |
latency/import |
3821530614.5 ns |
3829156614.5 ns |
1.00 |
latency/precompile |
4585635970.5 ns |
4603348569 ns |
1.00 |
latency/ttfp |
4391840056 ns |
4403053778.5 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Member
|
CI failures look related. |
e8402c0 to
6028087
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3119 +/- ##
==========================================
+ Coverage 16.41% 16.43% +0.01%
==========================================
Files 123 123
Lines 9678 9670 -8
==========================================
Hits 1589 1589
+ Misses 8089 8081 -8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3024
Summary
This PR fixes a bug where
!rangemetadata on LLVM intrinsics was causing anInt32overflow, which resulted in negative upper bounds and prevented LLVM from optimizing away standard bounds checks.The Cause
LLVM
!rangemetadata expects a half-open interval[start, stop). InCUDACore/src/device/intrinsics/indexing.jl, the exclusive upper bound is calculated as(range.stop + 1) % Int32.For intrinsics with large maximums, like
gridDim_x(2^31 - 1), adding 1 overflows the signed 32-bit integer, resulting in-2147483648. LLVM interprets this as a wrap-around range, effectively destroying bounds-check optimizations.The Fix
Added a safeguard in
_indexto only attach the!rangemetadata ifrange.stop < typemax(Int32). Small-bounded intrinsics (likeblockDim) keep their metadata, while large-bounded ones (likegridDim) safely omit it to prevent the overflow.Testing
Added a unit test in
test/core/device/intrinsics.jlusingCUDA.code_llvm(..., dump_module=true, raw=true)to verify that:!rangeis correctly emitted forblockDim().x.!rangeis safely omitted forgridDim().x.