Skip to content

Add device overrides for FastMath.pow_fast with integer exponents#3098

Merged
maleadt merged 4 commits intomasterfrom
tb/powi
Apr 15, 2026
Merged

Add device overrides for FastMath.pow_fast with integer exponents#3098
maleadt merged 4 commits intomasterfrom
tb/powi

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 14, 2026

Base.FastMath.pow_fast(::IEEEFloat, ::Int32) emits the llvm.powi intrinsic, which the NVPTX backend cannot lower (it has no runtime libcalls). Override with inline small-exponent specializations and fast libdevice fallbacks (__nv_fast_powf for Float32, __nv_powi for Float64).

Also makes both the fast and regular versions use compiler flags that allow folding them.

Fixes #3065

…3065)

`Base.FastMath.pow_fast(::IEEEFloat, ::Int32)` emits the `llvm.powi`
intrinsic, which the NVPTX backend cannot lower (it has no runtime
libcalls). Override with inline small-exponent specializations and
fast libdevice fallbacks (`__nv_fast_powf` for Float32, `__nv_powi`
for Float64).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: ceaa278 Previous: 7a46bf3 Ratio
array/accumulate/Float32/1d 100053 ns 100894.5 ns 0.99
array/accumulate/Float32/dims=1 75944.5 ns 76463 ns 0.99
array/accumulate/Float32/dims=1L 1583630 ns 1585055 ns 1.00
array/accumulate/Float32/dims=2 142636 ns 143523 ns 0.99
array/accumulate/Float32/dims=2L 656828 ns 656818 ns 1.00
array/accumulate/Int64/1d 118052 ns 118288.5 ns 1.00
array/accumulate/Int64/dims=1 79154 ns 79382 ns 1.00
array/accumulate/Int64/dims=1L 1694167 ns 1693416 ns 1.00
array/accumulate/Int64/dims=2 155297 ns 155618 ns 1.00
array/accumulate/Int64/dims=2L 961289 ns 961230.5 ns 1.00
array/broadcast 20163 ns 20246 ns 1.00
array/construct 1251.8 ns 1266.7 ns 0.99
array/copy 18051 ns 17864 ns 1.01
array/copyto!/cpu_to_gpu 212833.5 ns 213682 ns 1.00
array/copyto!/gpu_to_cpu 281079 ns 280934 ns 1.00
array/copyto!/gpu_to_gpu 10758 ns 10761 ns 1.00
array/iteration/findall/bool 134443.5 ns 134853 ns 1.00
array/iteration/findall/int 149111 ns 149561.5 ns 1.00
array/iteration/findfirst/bool 81062.5 ns 81079 ns 1.00
array/iteration/findfirst/int 83079 ns 83035 ns 1.00
array/iteration/findmin/1d 83387 ns 85177.5 ns 0.98
array/iteration/findmin/2d 116504 ns 116610 ns 1.00
array/iteration/logical 197089.5 ns 198335.5 ns 0.99
array/iteration/scalar 68052.5 ns 66209 ns 1.03
array/permutedims/2d 51991 ns 51867 ns 1.00
array/permutedims/3d 52575.5 ns 52676 ns 1.00
array/permutedims/4d 51231.5 ns 51384 ns 1.00
array/random/rand/Float32 12426 ns 12428 ns 1.00
array/random/rand/Int64 36410 ns 36390 ns 1.00
array/random/rand!/Float32 8395 ns 8508.333333333334 ns 0.99
array/random/rand!/Int64 33906 ns 34064.5 ns 1.00
array/random/randn/Float32 36883.5 ns 41681.5 ns 0.88
array/random/randn!/Float32 30714 ns 30603.5 ns 1.00
array/reductions/mapreduce/Float32/1d 34143 ns 34414 ns 0.99
array/reductions/mapreduce/Float32/dims=1 45465 ns 40677.5 ns 1.12
array/reductions/mapreduce/Float32/dims=1L 51053.5 ns 50902 ns 1.00
array/reductions/mapreduce/Float32/dims=2 55876 ns 56099.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 68830 ns 69125 ns 1.00
array/reductions/mapreduce/Int64/1d 42576 ns 41725 ns 1.02
array/reductions/mapreduce/Int64/dims=1 50585 ns 42515 ns 1.19
array/reductions/mapreduce/Int64/dims=1L 86969 ns 86920 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59351.5 ns 59341 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 84579 ns 84456 ns 1.00
array/reductions/reduce/Float32/1d 34383 ns 34580 ns 0.99
array/reductions/reduce/Float32/dims=1 45725 ns 48846.5 ns 0.94
array/reductions/reduce/Float32/dims=1L 51063 ns 51087 ns 1.00
array/reductions/reduce/Float32/dims=2 56301 ns 56436 ns 1.00
array/reductions/reduce/Float32/dims=2L 69369 ns 69633 ns 1.00
array/reductions/reduce/Int64/1d 42563 ns 41868 ns 1.02
array/reductions/reduce/Int64/dims=1 50473.5 ns 49828.5 ns 1.01
array/reductions/reduce/Int64/dims=1L 86809 ns 86965 ns 1.00
array/reductions/reduce/Int64/dims=2 59227 ns 59126 ns 1.00
array/reductions/reduce/Int64/dims=2L 84097 ns 84186 ns 1.00
array/reverse/1d 17715 ns 17569 ns 1.01
array/reverse/1dL 68284 ns 68149 ns 1.00
array/reverse/1dL_inplace 65572 ns 65619 ns 1.00
array/reverse/1d_inplace 8362.333333333334 ns 10156.5 ns 0.82
array/reverse/2d 20217 ns 20487 ns 0.99
array/reverse/2dL 72232 ns 72580.5 ns 1.00
array/reverse/2dL_inplace 65611 ns 65669 ns 1.00
array/reverse/2d_inplace 9911 ns 10363 ns 0.96
array/sorting/1d 2734237 ns 2732994 ns 1.00
array/sorting/2d 1068460 ns 1074852 ns 0.99
array/sorting/by 3303186 ns 3327030 ns 0.99
cuda/synchronization/context/auto 1205.3 ns 1115.1 ns 1.08
cuda/synchronization/context/blocking 998.0526315789474 ns 926.9285714285714 ns 1.08
cuda/synchronization/context/nonblocking 6937.8 ns 7166.9 ns 0.97
cuda/synchronization/stream/auto 1028.9 ns 972.7368421052631 ns 1.06
cuda/synchronization/stream/blocking 865.6724137931035 ns 783.65 ns 1.10
cuda/synchronization/stream/nonblocking 8062.6 ns 7202.5 ns 1.12
integration/byval/reference 143582 ns 143622 ns 1.00
integration/byval/slices=1 145650 ns 145542 ns 1.00
integration/byval/slices=2 284361 ns 284225 ns 1.00
integration/byval/slices=3 423033.5 ns 422875 ns 1.00
integration/cudadevrt 102246 ns 102233 ns 1.00
integration/volumerhs 23420457.5 ns 23473989 ns 1.00
kernel/indexing 13358 ns 13079 ns 1.02
kernel/indexing_checked 13847 ns 13831 ns 1.00
kernel/launch 2022.9 ns 2051.222222222222 ns 0.99
kernel/occupancy 667.7861635220125 ns 709.1025641025641 ns 0.94
kernel/rand 14268 ns 17700 ns 0.81
latency/import 3800820720 ns 3836580868 ns 0.99
latency/precompile 4587283028.5 ns 4602357164.5 ns 1.00
latency/ttfp 4387699614.5 ns 4418505339 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

maleadt and others added 3 commits April 15, 2026 11:13
Base defines `pow_fast(::FloatType, ::Integer)` in 1.10/1.11 and
`pow_fast(::FloatType, ::Int32)` in 1.12+. Using `Integer` matches
the former directly and still wins via the overlay table for the
latter, covering all supported Julia versions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same reasoning as 9611a3c: without foldable effects, constant
expressions like `@fastmath Float32(2)^(-32)` compile to runtime
polynomial pow approximation (~60 extra FMAs) instead of being
constant-folded at compile time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The device overrides for `^(::Float, ::Int64)` lacked effect annotations,
preventing the compiler from constant-folding expressions like
`Float32(2)^(-32)`. This compiled to ~60 extra FMAs from runtime
polynomial pow approximation.

Adding `@assume_effects :foldable` via `@device_override` enables full
compile-time constant folding (129 PTX lines / 19 FMA → 48 lines / 0 FMA).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.42%. Comparing base (01a0795) to head (ceaa278).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3098      +/-   ##
==========================================
- Coverage   90.43%   90.42%   -0.01%     
==========================================
  Files         141      141              
  Lines       12025    12025              
==========================================
- Hits        10875    10874       -1     
- Misses       1150     1151       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maleadt maleadt merged commit 5df05e4 into master Apr 15, 2026
2 checks passed
@maleadt maleadt deleted the tb/powi branch April 15, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Base.FastMath.pow_fast fails to compile with integer exponent

2 participants