Skip to content

Fix randn/randexp in complex GPU kernels by replacing recursion with loops#3086

Merged
maleadt merged 1 commit intomasterfrom
tb/rand
Apr 9, 2026
Merged

Fix randn/randexp in complex GPU kernels by replacing recursion with loops#3086
maleadt merged 1 commit intomasterfrom
tb/rand

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 9, 2026

The Ziggurat rejection sampling for randn and randexp used recursive calls back to Random.randn/Random.randexp when retrying, creating call cycles that could exhaust the GPU's limited per-thread stack in complex kernels. Replace these with while loops, using a nothing return from the @noinline unlikely helpers to signal retry.

Fixes #3028

…loops

The Ziggurat rejection sampling for randn and randexp used recursive calls
back to Random.randn/Random.randexp when retrying, creating call cycles that
could exhaust the GPU's limited per-thread stack in complex kernels. Replace
these with while loops, using a nothing return from the @noinline unlikely
helpers to signal retry.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.42%. Comparing base (a9a687c) to head (9d845b7).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3086   +/-   ##
=======================================
  Coverage   90.41%   90.42%           
=======================================
  Files         141      141           
  Lines       11993    11993           
=======================================
+ Hits        10844    10845    +1     
+ Misses       1149     1148    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 9d845b7 Previous: 5f45772 Ratio
array/accumulate/Float32/1d 101572 ns 101495 ns 1.00
array/accumulate/Float32/dims=1 76429 ns 76898 ns 0.99
array/accumulate/Float32/dims=1L 1583518.5 ns 1585143.5 ns 1.00
array/accumulate/Float32/dims=2 143396 ns 143801 ns 1.00
array/accumulate/Float32/dims=2L 657681 ns 657240 ns 1.00
array/accumulate/Int64/1d 118629 ns 118623 ns 1.00
array/accumulate/Int64/dims=1 79699 ns 80572.5 ns 0.99
array/accumulate/Int64/dims=1L 1694403.5 ns 1693852 ns 1.00
array/accumulate/Int64/dims=2 155719 ns 156484 ns 1.00
array/accumulate/Int64/dims=2L 961838 ns 961603 ns 1.00
array/broadcast 20460 ns 20294 ns 1.01
array/construct 1294.5 ns 1320.4 ns 0.98
array/copy 18764 ns 18780 ns 1.00
array/copyto!/cpu_to_gpu 215869 ns 214684 ns 1.01
array/copyto!/gpu_to_cpu 287011 ns 282072 ns 1.02
array/copyto!/gpu_to_gpu 11520 ns 11361 ns 1.01
array/iteration/findall/bool 132048 ns 131719.5 ns 1.00
array/iteration/findall/int 149637 ns 148883 ns 1.01
array/iteration/findfirst/bool 81505 ns 81470.5 ns 1.00
array/iteration/findfirst/int 83783 ns 83414 ns 1.00
array/iteration/findmin/1d 89907 ns 89419 ns 1.01
array/iteration/findmin/2d 117557 ns 117365 ns 1.00
array/iteration/logical 204362.5 ns 207612 ns 0.98
array/iteration/scalar 67784 ns 66780 ns 1.02
array/permutedims/2d 52396 ns 52471.5 ns 1.00
array/permutedims/3d 52862 ns 53137 ns 0.99
array/permutedims/4d 52673 ns 52429 ns 1.00
array/random/rand/Float32 13221 ns 13089 ns 1.01
array/random/rand/Int64 34713 ns 37236 ns 0.93
array/random/rand!/Float32 8426.333333333334 ns 8527.666666666666 ns 0.99
array/random/rand!/Int64 26934 ns 34109.5 ns 0.79
array/random/randn/Float32 39585 ns 38147 ns 1.04
array/random/randn!/Float32 31408.5 ns 31640 ns 0.99
array/reductions/mapreduce/Float32/1d 35584 ns 34735.5 ns 1.02
array/reductions/mapreduce/Float32/dims=1 40893.5 ns 40760 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 51769.5 ns 51917 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56681 ns 56503.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69676.5 ns 69496.5 ns 1.00
array/reductions/mapreduce/Int64/1d 43300 ns 42820 ns 1.01
array/reductions/mapreduce/Int64/dims=1 44563 ns 44181 ns 1.01
array/reductions/mapreduce/Int64/dims=1L 87913.5 ns 87798 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59744 ns 59808 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 85226 ns 85232 ns 1.00
array/reductions/reduce/Float32/1d 35692 ns 34883 ns 1.02
array/reductions/reduce/Float32/dims=1 40448 ns 39758 ns 1.02
array/reductions/reduce/Float32/dims=1L 52032 ns 52166 ns 1.00
array/reductions/reduce/Float32/dims=2 57113 ns 56925 ns 1.00
array/reductions/reduce/Float32/dims=2L 70240 ns 69909 ns 1.00
array/reductions/reduce/Int64/1d 43041.5 ns 42673 ns 1.01
array/reductions/reduce/Int64/dims=1 46569.5 ns 42123 ns 1.11
array/reductions/reduce/Int64/dims=1L 87663 ns 87782 ns 1.00
array/reductions/reduce/Int64/dims=2 59601 ns 59551 ns 1.00
array/reductions/reduce/Int64/dims=2L 85041 ns 84796 ns 1.00
array/reverse/1d 18568 ns 18432.5 ns 1.01
array/reverse/1dL 69097 ns 69025 ns 1.00
array/reverse/1dL_inplace 65979 ns 65968 ns 1.00
array/reverse/1d_inplace 8624.666666666666 ns 10240.666666666666 ns 0.84
array/reverse/2d 20893 ns 20709 ns 1.01
array/reverse/2dL 73045 ns 72815 ns 1.00
array/reverse/2dL_inplace 66059 ns 65992 ns 1.00
array/reverse/2d_inplace 10249 ns 11117.5 ns 0.92
array/sorting/1d 2735784 ns 2754859 ns 0.99
array/sorting/2d 1069187 ns 1075967 ns 0.99
array/sorting/by 3304331.5 ns 3328240 ns 0.99
cuda/synchronization/context/auto 1202.6 ns 1192.4 ns 1.01
cuda/synchronization/context/blocking 927.060606060606 ns 947.7391304347826 ns 0.98
cuda/synchronization/context/nonblocking 6917.2 ns 7660.1 ns 0.90
cuda/synchronization/stream/auto 1018.2 ns 1032.5 ns 0.99
cuda/synchronization/stream/blocking 831.9883720930233 ns 841.5588235294117 ns 0.99
cuda/synchronization/stream/nonblocking 8341.2 ns 7189.6 ns 1.16
integration/byval/reference 144056 ns 143997 ns 1.00
integration/byval/slices=1 146082 ns 145776 ns 1.00
integration/byval/slices=2 284924 ns 284427 ns 1.00
integration/byval/slices=3 423388 ns 423129 ns 1.00
integration/cudadevrt 102628 ns 102598 ns 1.00
integration/volumerhs 9436436.5 ns 9429742.5 ns 1.00
kernel/indexing 13424 ns 13331 ns 1.01
kernel/indexing_checked 14112 ns 14116 ns 1.00
kernel/launch 2193 ns 2147 ns 1.02
kernel/occupancy 693.86875 ns 660.5723270440252 ns 1.05
kernel/rand 18175 ns 15598 ns 1.17
latency/import 3811113291 ns 3807044359.5 ns 1.00
latency/precompile 4590100515 ns 4590923492 ns 1.00
latency/ttfp 4399572655 ns 4392969126 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt maleadt merged commit fc09951 into master Apr 9, 2026
2 checks passed
@maleadt maleadt deleted the tb/rand branch April 9, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

randn gives code 700, ERROR_ILLEGAL_ADDRESS, but not rand

1 participant