Add `feature_set` kwarg for selecting PTX target suffix by AntonOresten · Pull Request #3124 · JuliaGPU/CUDA.jl

AntonOresten · 2026-04-30T20:06:05Z

Adds support for family-specific (sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120; PTXCompilerTarget has no field for the suffix, so we rewrite .target and pass the matching --gpu-name to ptxas, sidestepping LLVM's NVPTX coverage.

NVIDIA defines three feature sets:

baseline (no suffix, e.g. sm_90): forward-compatible.
family (f suffix, e.g. sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.
architecture (a suffix, e.g. sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.

with hierarchy baseline ⊆ family ⊆ architecture.

The previous behavior remains as the default through :baseline. To unblock wgmma, tcgen05, and friends, explicit opt-in is required through a new feature_set kwarg on cufunction, @cuda, etc. that takes a Symbol.

Relevant docs:

Supercedes #3122 (renamed branch; seems beyond rescue)

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `a5bbd52`	Previous: `d627b3e`	Ratio
`array/accumulate/Float32/1d`	`100948` ns	`100853` ns	`1.00`
`array/accumulate/Float32/dims=1`	`77293` ns	`76362` ns	`1.01`
`array/accumulate/Float32/dims=1L`	`1585549` ns	`1585357` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143961` ns	`143506` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`657709` ns	`657416` ns	`1.00`
`array/accumulate/Int64/1d`	`119018.5` ns	`118459.5` ns	`1.00`
`array/accumulate/Int64/dims=1`	`80021` ns	`79599` ns	`1.01`
`array/accumulate/Int64/dims=1L`	`1694727` ns	`1694409` ns	`1.00`
`array/accumulate/Int64/dims=2`	`156115` ns	`155450.5` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`961658` ns	`960958` ns	`1.00`
`array/broadcast`	`20478` ns	`20282` ns	`1.01`
`array/construct`	`1257.4` ns	`1328.2` ns	`0.95`
`array/copy`	`17904` ns	`17911` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`214035` ns	`212984` ns	`1.00`
`array/copyto!/gpu_to_cpu`	`283301` ns	`280259` ns	`1.01`
`array/copyto!/gpu_to_gpu`	`10796` ns	`10684` ns	`1.01`
`array/iteration/findall/bool`	`134546` ns	`134327` ns	`1.00`
`array/iteration/findall/int`	`149759` ns	`149222` ns	`1.00`
`array/iteration/findfirst/bool`	`81579.5` ns	`81163` ns	`1.01`
`array/iteration/findfirst/int`	`83649.5` ns	`82984` ns	`1.01`
`array/iteration/findmin/1d`	`84753` ns	`85430.5` ns	`0.99`
`array/iteration/findmin/2d`	`116861` ns	`116921` ns	`1.00`
`array/iteration/logical`	`200246.5` ns	`196920` ns	`1.02`
`array/iteration/scalar`	`67181` ns	`67453` ns	`1.00`
`array/permutedims/2d`	`52272` ns	`52024.5` ns	`1.00`
`array/permutedims/3d`	`53031` ns	`52690` ns	`1.01`
`array/permutedims/4d`	`51658` ns	`51921.5` ns	`0.99`
`array/random/rand/Float32`	`12992` ns	`12625` ns	`1.03`
`array/random/rand/Int64`	`25064` ns	`24502` ns	`1.02`
`array/random/rand!/Float32`	`9158.666666666666` ns	`8842` ns	`1.04`
`array/random/rand!/Int64`	`21818` ns	`21520` ns	`1.01`
`array/random/randn/Float32`	`43441.5` ns	`38339` ns	`1.13`
`array/random/randn!/Float32`	`30666` ns	`30558` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`34244` ns	`34493` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1`	`39467.5` ns	`39238` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`51284.5` ns	`51115` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56416` ns	`56362` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2L`	`69587.5` ns	`68855` ns	`1.01`
`array/reductions/mapreduce/Int64/1d`	`42688` ns	`41840` ns	`1.02`
`array/reductions/mapreduce/Int64/dims=1`	`42730` ns	`43041.5` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=1L`	`87079` ns	`87062` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59466` ns	`59400.5` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`84461` ns	`84650` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`34678` ns	`34748` ns	`1.00`
`array/reductions/reduce/Float32/dims=1`	`39520.5` ns	`42567` ns	`0.93`
`array/reductions/reduce/Float32/dims=1L`	`51380` ns	`51269` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56874` ns	`56421.5` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`69919` ns	`69423` ns	`1.01`
`array/reductions/reduce/Int64/1d`	`42499` ns	`41786` ns	`1.02`
`array/reductions/reduce/Int64/dims=1`	`42508.5` ns	`41701` ns	`1.02`
`array/reductions/reduce/Int64/dims=1L`	`86990` ns	`87004` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59418` ns	`59173` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84416.5` ns	`84474` ns	`1.00`
`array/reverse/1d`	`17777` ns	`17689` ns	`1.00`
`array/reverse/1dL`	`68390` ns	`68239` ns	`1.00`
`array/reverse/1dL_inplace`	`65667` ns	`65528` ns	`1.00`
`array/reverse/1d_inplace`	`10290.666666666666` ns	`8294.666666666666` ns	`1.24`
`array/reverse/2d`	`20830` ns	`20296` ns	`1.03`
`array/reverse/2dL`	`72840` ns	`72502` ns	`1.00`
`array/reverse/2dL_inplace`	`65730` ns	`65556` ns	`1.00`
`array/reverse/2d_inplace`	`10515` ns	`9743` ns	`1.08`
`array/sorting/1d`	`2735412` ns	`2734808` ns	`1.00`
`array/sorting/2d`	`1068472` ns	`1071711` ns	`1.00`
`array/sorting/by`	`3305356` ns	`3304331` ns	`1.00`
`cuda/synchronization/context/auto`	`1111` ns	`1162.7` ns	`0.96`
`cuda/synchronization/context/blocking`	`922.3571428571429` ns	`928.3666666666667` ns	`0.99`
`cuda/synchronization/context/nonblocking`	`7181.1` ns	`6978.1` ns	`1.03`
`cuda/synchronization/stream/auto`	`981` ns	`994.2105263157895` ns	`0.99`
`cuda/synchronization/stream/blocking`	`838.9` ns	`840.6543209876543` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`7158.2` ns	`7241.299999999999` ns	`0.99`
`integration/byval/reference`	`143668` ns	`143552.5` ns	`1.00`
`integration/byval/slices=1`	`145721` ns	`145385` ns	`1.00`
`integration/byval/slices=2`	`284356` ns	`284202.5` ns	`1.00`
`integration/byval/slices=3`	`422887` ns	`422620` ns	`1.00`
`integration/cudadevrt`	`102278` ns	`102213` ns	`1.00`
`integration/volumerhs`	`23411536.5` ns	`23474593` ns	`1.00`
`kernel/indexing`	`12933` ns	`12944` ns	`1.00`
`kernel/indexing_checked`	`13774` ns	`13691.5` ns	`1.01`
`kernel/launch`	`2065.8888888888887` ns	`2106.5555555555557` ns	`0.98`
`kernel/occupancy`	`677.3290322580646` ns	`665.54375` ns	`1.02`
`kernel/rand`	`14197` ns	`13998` ns	`1.01`
`latency/import`	`3890103566` ns	`3829156614.5` ns	`1.02`
`latency/precompile`	`4592777161` ns	`4603348569` ns	`1.00`
`latency/ttfp`	`4431890889.5` ns	`4403053778.5` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

AntonOresten · 2026-05-01T11:32:00Z

While I am convinced :baseline as the default is correct for backward-compatibility (in CUDA.jl) and forward-compatiblity (in PTX?), @cuda feature_set=:architecture is quite verbose, especially considering @cuda only takes arguments on a single line. Perhaps it can be made to take blocks (see JuliaGPU/cuTile.jl#40 (comment)), or shortening :architecture -> :arch.

EDIT: Shortened :architecture to :arch.

…type

codecov · 2026-05-01T17:26:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.42%. Comparing base (d627b3e) to head (a5bbd52).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3124      +/-   ##
==========================================
+ Coverage   16.41%   16.42%   +0.01%     
==========================================
  Files         123      123              
  Lines        9678     9678              
==========================================
+ Hits         1589     1590       +1     
+ Misses       8089     8088       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot reviewed Apr 30, 2026

View reviewed changes

AntonOresten added 5 commits May 1, 2026 17:42

Add architecture- and family-specific PTX target support

b98c518

Remove kind compiler config kwarg

9fd5464

Add feature_set compiler config kwarg; use symbols instead of enum …

494bd93

…type

Require opt-in through feature_set to avoid breaking changes

e8537f6

Shorten :architecture to :arch

a5bbd52

AntonOresten force-pushed the feature-set branch from 2f3cea9 to a5bbd52 Compare May 1, 2026 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `feature_set` kwarg for selecting PTX target suffix#3124

Add `feature_set` kwarg for selecting PTX target suffix#3124
AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
AntonOresten:feature-set

AntonOresten commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

AntonOresten commented May 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AntonOresten commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

AntonOresten commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 1, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AntonOresten commented Apr 30, 2026 •

edited

Loading

github-actions Bot left a comment •

edited

Loading

AntonOresten commented May 1, 2026 •

edited

Loading