Add feature_set kwarg for selecting PTX target suffix#3124
Open
AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
Open
Add feature_set kwarg for selecting PTX target suffix#3124AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
feature_set kwarg for selecting PTX target suffix#3124AntonOresten wants to merge 5 commits intoJuliaGPU:masterfrom
Conversation
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: a5bbd52 | Previous: d627b3e | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100948 ns |
100853 ns |
1.00 |
array/accumulate/Float32/dims=1 |
77293 ns |
76362 ns |
1.01 |
array/accumulate/Float32/dims=1L |
1585549 ns |
1585357 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143961 ns |
143506 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657709 ns |
657416 ns |
1.00 |
array/accumulate/Int64/1d |
119018.5 ns |
118459.5 ns |
1.00 |
array/accumulate/Int64/dims=1 |
80021 ns |
79599 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1694727 ns |
1694409 ns |
1.00 |
array/accumulate/Int64/dims=2 |
156115 ns |
155450.5 ns |
1.00 |
array/accumulate/Int64/dims=2L |
961658 ns |
960958 ns |
1.00 |
array/broadcast |
20478 ns |
20282 ns |
1.01 |
array/construct |
1257.4 ns |
1328.2 ns |
0.95 |
array/copy |
17904 ns |
17911 ns |
1.00 |
array/copyto!/cpu_to_gpu |
214035 ns |
212984 ns |
1.00 |
array/copyto!/gpu_to_cpu |
283301 ns |
280259 ns |
1.01 |
array/copyto!/gpu_to_gpu |
10796 ns |
10684 ns |
1.01 |
array/iteration/findall/bool |
134546 ns |
134327 ns |
1.00 |
array/iteration/findall/int |
149759 ns |
149222 ns |
1.00 |
array/iteration/findfirst/bool |
81579.5 ns |
81163 ns |
1.01 |
array/iteration/findfirst/int |
83649.5 ns |
82984 ns |
1.01 |
array/iteration/findmin/1d |
84753 ns |
85430.5 ns |
0.99 |
array/iteration/findmin/2d |
116861 ns |
116921 ns |
1.00 |
array/iteration/logical |
200246.5 ns |
196920 ns |
1.02 |
array/iteration/scalar |
67181 ns |
67453 ns |
1.00 |
array/permutedims/2d |
52272 ns |
52024.5 ns |
1.00 |
array/permutedims/3d |
53031 ns |
52690 ns |
1.01 |
array/permutedims/4d |
51658 ns |
51921.5 ns |
0.99 |
array/random/rand/Float32 |
12992 ns |
12625 ns |
1.03 |
array/random/rand/Int64 |
25064 ns |
24502 ns |
1.02 |
array/random/rand!/Float32 |
9158.666666666666 ns |
8842 ns |
1.04 |
array/random/rand!/Int64 |
21818 ns |
21520 ns |
1.01 |
array/random/randn/Float32 |
43441.5 ns |
38339 ns |
1.13 |
array/random/randn!/Float32 |
30666 ns |
30558 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34244 ns |
34493 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=1 |
39467.5 ns |
39238 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
51284.5 ns |
51115 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56416 ns |
56362 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
69587.5 ns |
68855 ns |
1.01 |
array/reductions/mapreduce/Int64/1d |
42688 ns |
41840 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1 |
42730 ns |
43041.5 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1L |
87079 ns |
87062 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59466 ns |
59400.5 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84461 ns |
84650 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34678 ns |
34748 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
39520.5 ns |
42567 ns |
0.93 |
array/reductions/reduce/Float32/dims=1L |
51380 ns |
51269 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56874 ns |
56421.5 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
69919 ns |
69423 ns |
1.01 |
array/reductions/reduce/Int64/1d |
42499 ns |
41786 ns |
1.02 |
array/reductions/reduce/Int64/dims=1 |
42508.5 ns |
41701 ns |
1.02 |
array/reductions/reduce/Int64/dims=1L |
86990 ns |
87004 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59418 ns |
59173 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84416.5 ns |
84474 ns |
1.00 |
array/reverse/1d |
17777 ns |
17689 ns |
1.00 |
array/reverse/1dL |
68390 ns |
68239 ns |
1.00 |
array/reverse/1dL_inplace |
65667 ns |
65528 ns |
1.00 |
array/reverse/1d_inplace |
10290.666666666666 ns |
8294.666666666666 ns |
1.24 |
array/reverse/2d |
20830 ns |
20296 ns |
1.03 |
array/reverse/2dL |
72840 ns |
72502 ns |
1.00 |
array/reverse/2dL_inplace |
65730 ns |
65556 ns |
1.00 |
array/reverse/2d_inplace |
10515 ns |
9743 ns |
1.08 |
array/sorting/1d |
2735412 ns |
2734808 ns |
1.00 |
array/sorting/2d |
1068472 ns |
1071711 ns |
1.00 |
array/sorting/by |
3305356 ns |
3304331 ns |
1.00 |
cuda/synchronization/context/auto |
1111 ns |
1162.7 ns |
0.96 |
cuda/synchronization/context/blocking |
922.3571428571429 ns |
928.3666666666667 ns |
0.99 |
cuda/synchronization/context/nonblocking |
7181.1 ns |
6978.1 ns |
1.03 |
cuda/synchronization/stream/auto |
981 ns |
994.2105263157895 ns |
0.99 |
cuda/synchronization/stream/blocking |
838.9 ns |
840.6543209876543 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
7158.2 ns |
7241.299999999999 ns |
0.99 |
integration/byval/reference |
143668 ns |
143552.5 ns |
1.00 |
integration/byval/slices=1 |
145721 ns |
145385 ns |
1.00 |
integration/byval/slices=2 |
284356 ns |
284202.5 ns |
1.00 |
integration/byval/slices=3 |
422887 ns |
422620 ns |
1.00 |
integration/cudadevrt |
102278 ns |
102213 ns |
1.00 |
integration/volumerhs |
23411536.5 ns |
23474593 ns |
1.00 |
kernel/indexing |
12933 ns |
12944 ns |
1.00 |
kernel/indexing_checked |
13774 ns |
13691.5 ns |
1.01 |
kernel/launch |
2065.8888888888887 ns |
2106.5555555555557 ns |
0.98 |
kernel/occupancy |
677.3290322580646 ns |
665.54375 ns |
1.02 |
kernel/rand |
14197 ns |
13998 ns |
1.01 |
latency/import |
3890103566 ns |
3829156614.5 ns |
1.02 |
latency/precompile |
4592777161 ns |
4603348569 ns |
1.00 |
latency/ttfp |
4431890889.5 ns |
4403053778.5 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
Contributor
Author
|
While I am convinced EDIT: Shortened |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3124 +/- ##
==========================================
+ Coverage 16.41% 16.42% +0.01%
==========================================
Files 123 123
Lines 9678 9678
==========================================
+ Hits 1589 1590 +1
+ Misses 8089 8088 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds support for family-specific (
sm_NNNf) and architecture-specific (sm_NNNa) PTX targets, enabling access to a wider set of low-level instructions. Builds on #3120;PTXCompilerTargethas no field for the suffix, so we rewrite.targetand pass the matching--gpu-nametoptxas, sidestepping LLVM's NVPTX coverage.NVIDIA defines three feature sets:
sm_90): forward-compatible.fsuffix, e.g.sm_100f): same-major-family-portable. Requires CC ≥ 10.0 and PTX ≥ 8.8.asuffix, e.g.sm_90a): locked to one exact CC. Requires CC ≥ 9.0 and PTX ≥ 8.0.with hierarchy baseline ⊆ family ⊆ architecture.
The previous behavior remains as the default through
:baseline. To unblockwgmma,tcgen05, and friends, explicit opt-in is required through a newfeature_setkwarg oncufunction,@cuda, etc. that takes aSymbol.Relevant docs:
Supercedes #3122 (renamed branch; seems beyond rescue)