Skip to content

Add feature_set kwarg for selecting PTX target suffix#3124

Open
AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
AntonOresten:feature-set
Open

Add feature_set kwarg for selecting PTX target suffix#3124
AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
AntonOresten:feature-set

Conversation

@AntonOresten
Copy link
Copy Markdown
Contributor

@AntonOresten AntonOresten commented Apr 30, 2026

Adds support for family-specific (sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

NVIDIA defines three feature sets:

  • baseline (no suffix, e.g. sm_90): forward-compatible.
  • family (f suffix, e.g. sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.
  • architecture (a suffix, e.g. sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.

with hierarchy baseline ⊆ family ⊆ architecture.

The previous behavior remains as the default through :baseline. To unblock wgmma, tcgen05, and friends, explicit opt-in is required through a new feature_set kwarg on cufunction, @cuda, etc. that takes a Symbol.

Relevant docs:

Supercedes #3122 (renamed branch; seems beyond rescue)

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: a5bbd52 Previous: d627b3e Ratio
array/accumulate/Float32/1d 100948 ns 100853 ns 1.00
array/accumulate/Float32/dims=1 77293 ns 76362 ns 1.01
array/accumulate/Float32/dims=1L 1585549 ns 1585357 ns 1.00
array/accumulate/Float32/dims=2 143961 ns 143506 ns 1.00
array/accumulate/Float32/dims=2L 657709 ns 657416 ns 1.00
array/accumulate/Int64/1d 119018.5 ns 118459.5 ns 1.00
array/accumulate/Int64/dims=1 80021 ns 79599 ns 1.01
array/accumulate/Int64/dims=1L 1694727 ns 1694409 ns 1.00
array/accumulate/Int64/dims=2 156115 ns 155450.5 ns 1.00
array/accumulate/Int64/dims=2L 961658 ns 960958 ns 1.00
array/broadcast 20478 ns 20282 ns 1.01
array/construct 1257.4 ns 1328.2 ns 0.95
array/copy 17904 ns 17911 ns 1.00
array/copyto!/cpu_to_gpu 214035 ns 212984 ns 1.00
array/copyto!/gpu_to_cpu 283301 ns 280259 ns 1.01
array/copyto!/gpu_to_gpu 10796 ns 10684 ns 1.01
array/iteration/findall/bool 134546 ns 134327 ns 1.00
array/iteration/findall/int 149759 ns 149222 ns 1.00
array/iteration/findfirst/bool 81579.5 ns 81163 ns 1.01
array/iteration/findfirst/int 83649.5 ns 82984 ns 1.01
array/iteration/findmin/1d 84753 ns 85430.5 ns 0.99
array/iteration/findmin/2d 116861 ns 116921 ns 1.00
array/iteration/logical 200246.5 ns 196920 ns 1.02
array/iteration/scalar 67181 ns 67453 ns 1.00
array/permutedims/2d 52272 ns 52024.5 ns 1.00
array/permutedims/3d 53031 ns 52690 ns 1.01
array/permutedims/4d 51658 ns 51921.5 ns 0.99
array/random/rand/Float32 12992 ns 12625 ns 1.03
array/random/rand/Int64 25064 ns 24502 ns 1.02
array/random/rand!/Float32 9158.666666666666 ns 8842 ns 1.04
array/random/rand!/Int64 21818 ns 21520 ns 1.01
array/random/randn/Float32 43441.5 ns 38339 ns 1.13
array/random/randn!/Float32 30666 ns 30558 ns 1.00
array/reductions/mapreduce/Float32/1d 34244 ns 34493 ns 0.99
array/reductions/mapreduce/Float32/dims=1 39467.5 ns 39238 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 51284.5 ns 51115 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56416 ns 56362 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 69587.5 ns 68855 ns 1.01
array/reductions/mapreduce/Int64/1d 42688 ns 41840 ns 1.02
array/reductions/mapreduce/Int64/dims=1 42730 ns 43041.5 ns 0.99
array/reductions/mapreduce/Int64/dims=1L 87079 ns 87062 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59466 ns 59400.5 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 84461 ns 84650 ns 1.00
array/reductions/reduce/Float32/1d 34678 ns 34748 ns 1.00
array/reductions/reduce/Float32/dims=1 39520.5 ns 42567 ns 0.93
array/reductions/reduce/Float32/dims=1L 51380 ns 51269 ns 1.00
array/reductions/reduce/Float32/dims=2 56874 ns 56421.5 ns 1.01
array/reductions/reduce/Float32/dims=2L 69919 ns 69423 ns 1.01
array/reductions/reduce/Int64/1d 42499 ns 41786 ns 1.02
array/reductions/reduce/Int64/dims=1 42508.5 ns 41701 ns 1.02
array/reductions/reduce/Int64/dims=1L 86990 ns 87004 ns 1.00
array/reductions/reduce/Int64/dims=2 59418 ns 59173 ns 1.00
array/reductions/reduce/Int64/dims=2L 84416.5 ns 84474 ns 1.00
array/reverse/1d 17777 ns 17689 ns 1.00
array/reverse/1dL 68390 ns 68239 ns 1.00
array/reverse/1dL_inplace 65667 ns 65528 ns 1.00
array/reverse/1d_inplace 10290.666666666666 ns 8294.666666666666 ns 1.24
array/reverse/2d 20830 ns 20296 ns 1.03
array/reverse/2dL 72840 ns 72502 ns 1.00
array/reverse/2dL_inplace 65730 ns 65556 ns 1.00
array/reverse/2d_inplace 10515 ns 9743 ns 1.08
array/sorting/1d 2735412 ns 2734808 ns 1.00
array/sorting/2d 1068472 ns 1071711 ns 1.00
array/sorting/by 3305356 ns 3304331 ns 1.00
cuda/synchronization/context/auto 1111 ns 1162.7 ns 0.96
cuda/synchronization/context/blocking 922.3571428571429 ns 928.3666666666667 ns 0.99
cuda/synchronization/context/nonblocking 7181.1 ns 6978.1 ns 1.03
cuda/synchronization/stream/auto 981 ns 994.2105263157895 ns 0.99
cuda/synchronization/stream/blocking 838.9 ns 840.6543209876543 ns 1.00
cuda/synchronization/stream/nonblocking 7158.2 ns 7241.299999999999 ns 0.99
integration/byval/reference 143668 ns 143552.5 ns 1.00
integration/byval/slices=1 145721 ns 145385 ns 1.00
integration/byval/slices=2 284356 ns 284202.5 ns 1.00
integration/byval/slices=3 422887 ns 422620 ns 1.00
integration/cudadevrt 102278 ns 102213 ns 1.00
integration/volumerhs 23411536.5 ns 23474593 ns 1.00
kernel/indexing 12933 ns 12944 ns 1.00
kernel/indexing_checked 13774 ns 13691.5 ns 1.01
kernel/launch 2065.8888888888887 ns 2106.5555555555557 ns 0.98
kernel/occupancy 677.3290322580646 ns 665.54375 ns 1.02
kernel/rand 14197 ns 13998 ns 1.01
latency/import 3890103566 ns 3829156614.5 ns 1.02
latency/precompile 4592777161 ns 4603348569 ns 1.00
latency/ttfp 4431890889.5 ns 4403053778.5 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@AntonOresten
Copy link
Copy Markdown
Contributor Author

AntonOresten commented May 1, 2026

While I am convinced :baseline as the default is correct for backward-compatibility (in CUDA.jl) and forward-compatiblity (in PTX?), @cuda feature_set=:architecture is quite verbose, especially considering @cuda only takes arguments on a single line. Perhaps it can be made to take blocks (see JuliaGPU/cuTile.jl#40 (comment)), or shortening :architecture -> :arch.

EDIT: Shortened :architecture to :arch.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.42%. Comparing base (d627b3e) to head (a5bbd52).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3124      +/-   ##
==========================================
+ Coverage   16.41%   16.42%   +0.01%     
==========================================
  Files         123      123              
  Lines        9678     9678              
==========================================
+ Hits         1589     1590       +1     
+ Misses       8089     8088       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant