Add device overrides for `FastMath.pow_fast` with integer exponents by maleadt · Pull Request #3098 · JuliaGPU/CUDA.jl

maleadt · 2026-04-14T14:19:53Z

Base.FastMath.pow_fast(::IEEEFloat, ::Int32) emits the llvm.powi intrinsic, which the NVPTX backend cannot lower (it has no runtime libcalls). Override with inline small-exponent specializations and fast libdevice fallbacks (__nv_fast_powf for Float32, __nv_powi for Float64).

Also makes both the fast and regular versions use compiler flags that allow folding them.

Fixes #3065

…3065) `Base.FastMath.pow_fast(::IEEEFloat, ::Int32)` emits the `llvm.powi` intrinsic, which the NVPTX backend cannot lower (it has no runtime libcalls). Override with inline small-exponent specializations and fast libdevice fallbacks (`__nv_fast_powf` for Float32, `__nv_powi` for Float64). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `ceaa278`	Previous: `7a46bf3`	Ratio
`array/accumulate/Float32/1d`	`100053` ns	`100894.5` ns	`0.99`
`array/accumulate/Float32/dims=1`	`75944.5` ns	`76463` ns	`0.99`
`array/accumulate/Float32/dims=1L`	`1583630` ns	`1585055` ns	`1.00`
`array/accumulate/Float32/dims=2`	`142636` ns	`143523` ns	`0.99`
`array/accumulate/Float32/dims=2L`	`656828` ns	`656818` ns	`1.00`
`array/accumulate/Int64/1d`	`118052` ns	`118288.5` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79154` ns	`79382` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1694167` ns	`1693416` ns	`1.00`
`array/accumulate/Int64/dims=2`	`155297` ns	`155618` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`961289` ns	`961230.5` ns	`1.00`
`array/broadcast`	`20163` ns	`20246` ns	`1.00`
`array/construct`	`1251.8` ns	`1266.7` ns	`0.99`
`array/copy`	`18051` ns	`17864` ns	`1.01`
`array/copyto!/cpu_to_gpu`	`212833.5` ns	`213682` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`281079` ns	`280934` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`10758` ns	`10761` ns	`1.00`
`array/iteration/findall/bool`	`134443.5` ns	`134853` ns	`1.00`
`array/iteration/findall/int`	`149111` ns	`149561.5` ns	`1.00`
`array/iteration/findfirst/bool`	`81062.5` ns	`81079` ns	`1.00`
`array/iteration/findfirst/int`	`83079` ns	`83035` ns	`1.00`
`array/iteration/findmin/1d`	`83387` ns	`85177.5` ns	`0.98`
`array/iteration/findmin/2d`	`116504` ns	`116610` ns	`1.00`
`array/iteration/logical`	`197089.5` ns	`198335.5` ns	`0.99`
`array/iteration/scalar`	`68052.5` ns	`66209` ns	`1.03`
`array/permutedims/2d`	`51991` ns	`51867` ns	`1.00`
`array/permutedims/3d`	`52575.5` ns	`52676` ns	`1.00`
`array/permutedims/4d`	`51231.5` ns	`51384` ns	`1.00`
`array/random/rand/Float32`	`12426` ns	`12428` ns	`1.00`
`array/random/rand/Int64`	`36410` ns	`36390` ns	`1.00`
`array/random/rand!/Float32`	`8395` ns	`8508.333333333334` ns	`0.99`
`array/random/rand!/Int64`	`33906` ns	`34064.5` ns	`1.00`
`array/random/randn/Float32`	`36883.5` ns	`41681.5` ns	`0.88`
`array/random/randn!/Float32`	`30714` ns	`30603.5` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`34143` ns	`34414` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1`	`45465` ns	`40677.5` ns	`1.12`
`array/reductions/mapreduce/Float32/dims=1L`	`51053.5` ns	`50902` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`55876` ns	`56099.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`68830` ns	`69125` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`42576` ns	`41725` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=1`	`50585` ns	`42515` ns	`1.19`
`array/reductions/mapreduce/Int64/dims=1L`	`86969` ns	`86920` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59351.5` ns	`59341` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`84579` ns	`84456` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`34383` ns	`34580` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`45725` ns	`48846.5` ns	`0.94`
`array/reductions/reduce/Float32/dims=1L`	`51063` ns	`51087` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56301` ns	`56436` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`69369` ns	`69633` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`42563` ns	`41868` ns	`1.02`
`array/reductions/reduce/Int64/dims=1`	`50473.5` ns	`49828.5` ns	`1.01`
`array/reductions/reduce/Int64/dims=1L`	`86809` ns	`86965` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59227` ns	`59126` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84097` ns	`84186` ns	`1.00`
`array/reverse/1d`	`17715` ns	`17569` ns	`1.01`
`array/reverse/1dL`	`68284` ns	`68149` ns	`1.00`
`array/reverse/1dL_inplace`	`65572` ns	`65619` ns	`1.00`
`array/reverse/1d_inplace`	`8362.333333333334` ns	`10156.5` ns	`0.82`
`array/reverse/2d`	`20217` ns	`20487` ns	`0.99`
`array/reverse/2dL`	`72232` ns	`72580.5` ns	`1.00`
`array/reverse/2dL_inplace`	`65611` ns	`65669` ns	`1.00`
`array/reverse/2d_inplace`	`9911` ns	`10363` ns	`0.96`
`array/sorting/1d`	`2734237` ns	`2732994` ns	`1.00`
`array/sorting/2d`	`1068460` ns	`1074852` ns	`0.99`
`array/sorting/by`	`3303186` ns	`3327030` ns	`0.99`
`cuda/synchronization/context/auto`	`1205.3` ns	`1115.1` ns	`1.08`
`cuda/synchronization/context/blocking`	`998.0526315789474` ns	`926.9285714285714` ns	`1.08`
`cuda/synchronization/context/nonblocking`	`6937.8` ns	`7166.9` ns	`0.97`
`cuda/synchronization/stream/auto`	`1028.9` ns	`972.7368421052631` ns	`1.06`
`cuda/synchronization/stream/blocking`	`865.6724137931035` ns	`783.65` ns	`1.10`
`cuda/synchronization/stream/nonblocking`	`8062.6` ns	`7202.5` ns	`1.12`
`integration/byval/reference`	`143582` ns	`143622` ns	`1.00`
`integration/byval/slices=1`	`145650` ns	`145542` ns	`1.00`
`integration/byval/slices=2`	`284361` ns	`284225` ns	`1.00`
`integration/byval/slices=3`	`423033.5` ns	`422875` ns	`1.00`
`integration/cudadevrt`	`102246` ns	`102233` ns	`1.00`
`integration/volumerhs`	`23420457.5` ns	`23473989` ns	`1.00`
`kernel/indexing`	`13358` ns	`13079` ns	`1.02`
`kernel/indexing_checked`	`13847` ns	`13831` ns	`1.00`
`kernel/launch`	`2022.9` ns	`2051.222222222222` ns	`0.99`
`kernel/occupancy`	`667.7861635220125` ns	`709.1025641025641` ns	`0.94`
`kernel/rand`	`14268` ns	`17700` ns	`0.81`
`latency/import`	`3800820720` ns	`3836580868` ns	`0.99`
`latency/precompile`	`4587283028.5` ns	`4602357164.5` ns	`1.00`
`latency/ttfp`	`4387699614.5` ns	`4418505339` ns	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

Base defines `pow_fast(::FloatType, ::Integer)` in 1.10/1.11 and `pow_fast(::FloatType, ::Int32)` in 1.12+. Using `Integer` matches the former directly and still wins via the overlay table for the latter, covering all supported Julia versions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same reasoning as 9611a3c: without foldable effects, constant expressions like `@fastmath Float32(2)^(-32)` compile to runtime polynomial pow approximation (~60 extra FMAs) instead of being constant-folded at compile time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The device overrides for `^(::Float, ::Int64)` lacked effect annotations, preventing the compiler from constant-folding expressions like `Float32(2)^(-32)`. This compiled to ~60 extra FMAs from runtime polynomial pow approximation. Adding `@assume_effects :foldable` via `@device_override` enables full compile-time constant folding (129 PTX lines / 19 FMA → 48 lines / 0 FMA). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-15T12:22:41Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.42%. Comparing base (01a0795) to head (ceaa278).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3098      +/-   ##
==========================================
- Coverage   90.43%   90.42%   -0.01%     
==========================================
  Files         141      141              
  Lines       12025    12025              
==========================================
- Hits        10875    10874       -1     
- Misses       1150     1151       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vchuravy approved these changes Apr 14, 2026

View reviewed changes

github-actions Bot reviewed Apr 14, 2026

View reviewed changes

maleadt and others added 3 commits April 15, 2026 11:13

maleadt merged commit 5df05e4 into master Apr 15, 2026
2 checks passed

maleadt deleted the tb/powi branch April 15, 2026 13:02

maleadt mentioned this pull request Apr 21, 2026

Use an extra bit of entropy in the Float64 uniform 0-1 distribution. JuliaGPU/GPUArrays.jl#712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add device overrides for `FastMath.pow_fast` with integer exponents#3098

Add device overrides for `FastMath.pow_fast` with integer exponents#3098
maleadt merged 4 commits intomasterfrom
tb/powi

maleadt commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maleadt commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maleadt commented Apr 14, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading