Conversation
…loops The Ziggurat rejection sampling for randn and randexp used recursive calls back to Random.randn/Random.randexp when retrying, creating call cycles that could exhaust the GPU's limited per-thread stack in complex kernels. Replace these with while loops, using a nothing return from the @noinline unlikely helpers to signal retry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3086 +/- ##
=======================================
Coverage 90.41% 90.42%
=======================================
Files 141 141
Lines 11993 11993
=======================================
+ Hits 10844 10845 +1
+ Misses 1149 1148 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 9d845b7 | Previous: 5f45772 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101572 ns |
101495 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76429 ns |
76898 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1583518.5 ns |
1585143.5 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143396 ns |
143801 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657681 ns |
657240 ns |
1.00 |
array/accumulate/Int64/1d |
118629 ns |
118623 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79699 ns |
80572.5 ns |
0.99 |
array/accumulate/Int64/dims=1L |
1694403.5 ns |
1693852 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155719 ns |
156484 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961838 ns |
961603 ns |
1.00 |
array/broadcast |
20460 ns |
20294 ns |
1.01 |
array/construct |
1294.5 ns |
1320.4 ns |
0.98 |
array/copy |
18764 ns |
18780 ns |
1.00 |
array/copyto!/cpu_to_gpu |
215869 ns |
214684 ns |
1.01 |
array/copyto!/gpu_to_cpu |
287011 ns |
282072 ns |
1.02 |
array/copyto!/gpu_to_gpu |
11520 ns |
11361 ns |
1.01 |
array/iteration/findall/bool |
132048 ns |
131719.5 ns |
1.00 |
array/iteration/findall/int |
149637 ns |
148883 ns |
1.01 |
array/iteration/findfirst/bool |
81505 ns |
81470.5 ns |
1.00 |
array/iteration/findfirst/int |
83783 ns |
83414 ns |
1.00 |
array/iteration/findmin/1d |
89907 ns |
89419 ns |
1.01 |
array/iteration/findmin/2d |
117557 ns |
117365 ns |
1.00 |
array/iteration/logical |
204362.5 ns |
207612 ns |
0.98 |
array/iteration/scalar |
67784 ns |
66780 ns |
1.02 |
array/permutedims/2d |
52396 ns |
52471.5 ns |
1.00 |
array/permutedims/3d |
52862 ns |
53137 ns |
0.99 |
array/permutedims/4d |
52673 ns |
52429 ns |
1.00 |
array/random/rand/Float32 |
13221 ns |
13089 ns |
1.01 |
array/random/rand/Int64 |
34713 ns |
37236 ns |
0.93 |
array/random/rand!/Float32 |
8426.333333333334 ns |
8527.666666666666 ns |
0.99 |
array/random/rand!/Int64 |
26934 ns |
34109.5 ns |
0.79 |
array/random/randn/Float32 |
39585 ns |
38147 ns |
1.04 |
array/random/randn!/Float32 |
31408.5 ns |
31640 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
35584 ns |
34735.5 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1 |
40893.5 ns |
40760 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
51769.5 ns |
51917 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56681 ns |
56503.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69676.5 ns |
69496.5 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
43300 ns |
42820 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1 |
44563 ns |
44181 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
87913.5 ns |
87798 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59744 ns |
59808 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
85226 ns |
85232 ns |
1.00 |
array/reductions/reduce/Float32/1d |
35692 ns |
34883 ns |
1.02 |
array/reductions/reduce/Float32/dims=1 |
40448 ns |
39758 ns |
1.02 |
array/reductions/reduce/Float32/dims=1L |
52032 ns |
52166 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
57113 ns |
56925 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
70240 ns |
69909 ns |
1.00 |
array/reductions/reduce/Int64/1d |
43041.5 ns |
42673 ns |
1.01 |
array/reductions/reduce/Int64/dims=1 |
46569.5 ns |
42123 ns |
1.11 |
array/reductions/reduce/Int64/dims=1L |
87663 ns |
87782 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59601 ns |
59551 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
85041 ns |
84796 ns |
1.00 |
array/reverse/1d |
18568 ns |
18432.5 ns |
1.01 |
array/reverse/1dL |
69097 ns |
69025 ns |
1.00 |
array/reverse/1dL_inplace |
65979 ns |
65968 ns |
1.00 |
array/reverse/1d_inplace |
8624.666666666666 ns |
10240.666666666666 ns |
0.84 |
array/reverse/2d |
20893 ns |
20709 ns |
1.01 |
array/reverse/2dL |
73045 ns |
72815 ns |
1.00 |
array/reverse/2dL_inplace |
66059 ns |
65992 ns |
1.00 |
array/reverse/2d_inplace |
10249 ns |
11117.5 ns |
0.92 |
array/sorting/1d |
2735784 ns |
2754859 ns |
0.99 |
array/sorting/2d |
1069187 ns |
1075967 ns |
0.99 |
array/sorting/by |
3304331.5 ns |
3328240 ns |
0.99 |
cuda/synchronization/context/auto |
1202.6 ns |
1192.4 ns |
1.01 |
cuda/synchronization/context/blocking |
927.060606060606 ns |
947.7391304347826 ns |
0.98 |
cuda/synchronization/context/nonblocking |
6917.2 ns |
7660.1 ns |
0.90 |
cuda/synchronization/stream/auto |
1018.2 ns |
1032.5 ns |
0.99 |
cuda/synchronization/stream/blocking |
831.9883720930233 ns |
841.5588235294117 ns |
0.99 |
cuda/synchronization/stream/nonblocking |
8341.2 ns |
7189.6 ns |
1.16 |
integration/byval/reference |
144056 ns |
143997 ns |
1.00 |
integration/byval/slices=1 |
146082 ns |
145776 ns |
1.00 |
integration/byval/slices=2 |
284924 ns |
284427 ns |
1.00 |
integration/byval/slices=3 |
423388 ns |
423129 ns |
1.00 |
integration/cudadevrt |
102628 ns |
102598 ns |
1.00 |
integration/volumerhs |
9436436.5 ns |
9429742.5 ns |
1.00 |
kernel/indexing |
13424 ns |
13331 ns |
1.01 |
kernel/indexing_checked |
14112 ns |
14116 ns |
1.00 |
kernel/launch |
2193 ns |
2147 ns |
1.02 |
kernel/occupancy |
693.86875 ns |
660.5723270440252 ns |
1.05 |
kernel/rand |
18175 ns |
15598 ns |
1.17 |
latency/import |
3811113291 ns |
3807044359.5 ns |
1.00 |
latency/precompile |
4590100515 ns |
4590923492 ns |
1.00 |
latency/ttfp |
4399572655 ns |
4392969126 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Ziggurat rejection sampling for
randnandrandexpused recursive calls back toRandom.randn/Random.randexpwhen retrying, creating call cycles that could exhaust the GPU's limited per-thread stack in complex kernels. Replace these with while loops, using a nothing return from the@noinlineunlikely helpers to signal retry.Fixes #3028