Conversation
…3065) `Base.FastMath.pow_fast(::IEEEFloat, ::Int32)` emits the `llvm.powi` intrinsic, which the NVPTX backend cannot lower (it has no runtime libcalls). Override with inline small-exponent specializations and fast libdevice fallbacks (`__nv_fast_powf` for Float32, `__nv_powi` for Float64). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vchuravy
approved these changes
Apr 14, 2026
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: ceaa278 | Previous: 7a46bf3 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100053 ns |
100894.5 ns |
0.99 |
array/accumulate/Float32/dims=1 |
75944.5 ns |
76463 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1583630 ns |
1585055 ns |
1.00 |
array/accumulate/Float32/dims=2 |
142636 ns |
143523 ns |
0.99 |
array/accumulate/Float32/dims=2L |
656828 ns |
656818 ns |
1.00 |
array/accumulate/Int64/1d |
118052 ns |
118288.5 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79154 ns |
79382 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1694167 ns |
1693416 ns |
1.00 |
array/accumulate/Int64/dims=2 |
155297 ns |
155618 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961289 ns |
961230.5 ns |
1.00 |
array/broadcast |
20163 ns |
20246 ns |
1.00 |
array/construct |
1251.8 ns |
1266.7 ns |
0.99 |
array/copy |
18051 ns |
17864 ns |
1.01 |
array/copyto!/cpu_to_gpu |
212833.5 ns |
213682 ns |
1.00 |
array/copyto!/gpu_to_cpu |
281079 ns |
280934 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10758 ns |
10761 ns |
1.00 |
array/iteration/findall/bool |
134443.5 ns |
134853 ns |
1.00 |
array/iteration/findall/int |
149111 ns |
149561.5 ns |
1.00 |
array/iteration/findfirst/bool |
81062.5 ns |
81079 ns |
1.00 |
array/iteration/findfirst/int |
83079 ns |
83035 ns |
1.00 |
array/iteration/findmin/1d |
83387 ns |
85177.5 ns |
0.98 |
array/iteration/findmin/2d |
116504 ns |
116610 ns |
1.00 |
array/iteration/logical |
197089.5 ns |
198335.5 ns |
0.99 |
array/iteration/scalar |
68052.5 ns |
66209 ns |
1.03 |
array/permutedims/2d |
51991 ns |
51867 ns |
1.00 |
array/permutedims/3d |
52575.5 ns |
52676 ns |
1.00 |
array/permutedims/4d |
51231.5 ns |
51384 ns |
1.00 |
array/random/rand/Float32 |
12426 ns |
12428 ns |
1.00 |
array/random/rand/Int64 |
36410 ns |
36390 ns |
1.00 |
array/random/rand!/Float32 |
8395 ns |
8508.333333333334 ns |
0.99 |
array/random/rand!/Int64 |
33906 ns |
34064.5 ns |
1.00 |
array/random/randn/Float32 |
36883.5 ns |
41681.5 ns |
0.88 |
array/random/randn!/Float32 |
30714 ns |
30603.5 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34143 ns |
34414 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
45465 ns |
40677.5 ns |
1.12 |
array/reductions/mapreduce/Float32/dims=1L |
51053.5 ns |
50902 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
55876 ns |
56099.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
68830 ns |
69125 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
42576 ns |
41725 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1 |
50585 ns |
42515 ns |
1.19 |
array/reductions/mapreduce/Int64/dims=1L |
86969 ns |
86920 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59351.5 ns |
59341 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84579 ns |
84456 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34383 ns |
34580 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
45725 ns |
48846.5 ns |
0.94 |
array/reductions/reduce/Float32/dims=1L |
51063 ns |
51087 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56301 ns |
56436 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69369 ns |
69633 ns |
1.00 |
array/reductions/reduce/Int64/1d |
42563 ns |
41868 ns |
1.02 |
array/reductions/reduce/Int64/dims=1 |
50473.5 ns |
49828.5 ns |
1.01 |
array/reductions/reduce/Int64/dims=1L |
86809 ns |
86965 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59227 ns |
59126 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84097 ns |
84186 ns |
1.00 |
array/reverse/1d |
17715 ns |
17569 ns |
1.01 |
array/reverse/1dL |
68284 ns |
68149 ns |
1.00 |
array/reverse/1dL_inplace |
65572 ns |
65619 ns |
1.00 |
array/reverse/1d_inplace |
8362.333333333334 ns |
10156.5 ns |
0.82 |
array/reverse/2d |
20217 ns |
20487 ns |
0.99 |
array/reverse/2dL |
72232 ns |
72580.5 ns |
1.00 |
array/reverse/2dL_inplace |
65611 ns |
65669 ns |
1.00 |
array/reverse/2d_inplace |
9911 ns |
10363 ns |
0.96 |
array/sorting/1d |
2734237 ns |
2732994 ns |
1.00 |
array/sorting/2d |
1068460 ns |
1074852 ns |
0.99 |
array/sorting/by |
3303186 ns |
3327030 ns |
0.99 |
cuda/synchronization/context/auto |
1205.3 ns |
1115.1 ns |
1.08 |
cuda/synchronization/context/blocking |
998.0526315789474 ns |
926.9285714285714 ns |
1.08 |
cuda/synchronization/context/nonblocking |
6937.8 ns |
7166.9 ns |
0.97 |
cuda/synchronization/stream/auto |
1028.9 ns |
972.7368421052631 ns |
1.06 |
cuda/synchronization/stream/blocking |
865.6724137931035 ns |
783.65 ns |
1.10 |
cuda/synchronization/stream/nonblocking |
8062.6 ns |
7202.5 ns |
1.12 |
integration/byval/reference |
143582 ns |
143622 ns |
1.00 |
integration/byval/slices=1 |
145650 ns |
145542 ns |
1.00 |
integration/byval/slices=2 |
284361 ns |
284225 ns |
1.00 |
integration/byval/slices=3 |
423033.5 ns |
422875 ns |
1.00 |
integration/cudadevrt |
102246 ns |
102233 ns |
1.00 |
integration/volumerhs |
23420457.5 ns |
23473989 ns |
1.00 |
kernel/indexing |
13358 ns |
13079 ns |
1.02 |
kernel/indexing_checked |
13847 ns |
13831 ns |
1.00 |
kernel/launch |
2022.9 ns |
2051.222222222222 ns |
0.99 |
kernel/occupancy |
667.7861635220125 ns |
709.1025641025641 ns |
0.94 |
kernel/rand |
14268 ns |
17700 ns |
0.81 |
latency/import |
3800820720 ns |
3836580868 ns |
0.99 |
latency/precompile |
4587283028.5 ns |
4602357164.5 ns |
1.00 |
latency/ttfp |
4387699614.5 ns |
4418505339 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
Base defines `pow_fast(::FloatType, ::Integer)` in 1.10/1.11 and `pow_fast(::FloatType, ::Int32)` in 1.12+. Using `Integer` matches the former directly and still wins via the overlay table for the latter, covering all supported Julia versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same reasoning as 9611a3c: without foldable effects, constant expressions like `@fastmath Float32(2)^(-32)` compile to runtime polynomial pow approximation (~60 extra FMAs) instead of being constant-folded at compile time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The device overrides for `^(::Float, ::Int64)` lacked effect annotations, preventing the compiler from constant-folding expressions like `Float32(2)^(-32)`. This compiled to ~60 extra FMAs from runtime polynomial pow approximation. Adding `@assume_effects :foldable` via `@device_override` enables full compile-time constant folding (129 PTX lines / 19 FMA → 48 lines / 0 FMA). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3098 +/- ##
==========================================
- Coverage 90.43% 90.42% -0.01%
==========================================
Files 141 141
Lines 12025 12025
==========================================
- Hits 10875 10874 -1
- Misses 1150 1151 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Base.FastMath.pow_fast(::IEEEFloat, ::Int32)emits thellvm.powiintrinsic, which the NVPTX backend cannot lower (it has no runtime libcalls). Override with inline small-exponent specializations and fast libdevice fallbacks (__nv_fast_powffor Float32,__nv_powifor Float64).Also makes both the fast and regular versions use compiler flags that allow folding them.
Fixes #3065