Adapt to CUDA.jl v6 by giordano · Pull Request #2749 · EnzymeAD/Reactant.jl

giordano · 2026-03-30T16:56:59Z

Not quite ready, especially because it depends on upstream packages (~~ArrayInterface~~, ~~Flux~~, ~~Lux~~, ~~LuxLib~~, ~~NNlib~~, ~~NonuniformFFTs~~, ~~OneHotArrays~~) to adapt to the upcoming CUDA v6 first, but I'm saving my progress so far, at least with these changes I can barely precompile the CUDA extension.

wsmoses · 2026-03-30T22:34:35Z

is it possible to wait on this for 2 weeks?

wsmoses · 2026-03-30T22:35:03Z

-            CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo),
-            CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx);
+            GPUCompiler.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo),
+            CUDACore.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx);


this presumably breaks on cuda 5 right?

As is right now yes, but I'm pretty sure we can simply define const CUDACore = CUDA when CUDACore isn't defined, thus making all changes compatible with v5 as well. And I pushed some changes in CUDA.jl itself to break fewer things (like making all the @device_* macros available in the CUDA scope)

With cff4b0a the extension should be fully compatible with both CUDA v5 and v6.

giordano · 2026-03-30T22:43:24Z

is it possible to wait on this for 2 weeks?

As I mentioned above and elsewhere, this requires a lot of other packages to update to CUDA.jl v6 (which isn't even released)

giordano · 2026-03-30T22:48:57Z

Side note, half of changes to the extension are actually bug fixes independent of the upgrade to v6 (that only exposed the bugs), like trying to symbols from wrong modules.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

giordano · 2026-04-17T17:29:24Z

This is ready for review. I'd like to tag the new version after this is merged, some users already reported issues with Reactant being downgraded to very old versions when installing CUDA in the same environment.

avik-pal · 2026-04-18T16:54:09Z

KA tests are broken

giordano · 2026-04-18T16:57:20Z

it'd make my life easier to get a direct link, instead of having to scavenge the tests.

giordano · 2026-04-18T17:01:29Z

https://github.com/EnzymeAD/Reactant.jl/actions/runs/24578105753/job/71868640907?pr=2749#step:23:1470

ERROR: The following 1 direct dependency failed to precompile:

CUDAExt --code-coverage=@/home/runner/work/Reactant.jl/Reactant.jl --color=yes --check-bounds=yes --warn-overwrite=yes --depwarn=yes --inline=yes --startup-file=no --track-allocation=none --check-bounds=yes --compiled-modules=yes --depwarn=yes 

Failed to precompile CUDAExt [8d20f71a-eaa5-5402-8cb1-1e6062ff668e] to "/home/runner/.julia/compiled/v1.11/CUDAExt/jl_RwIJ4b".
WARNING: importing deprecated binding CUDA.CUBLAS into CUDAExt.
, use cuBLAS instead.
ERROR: LoadError: UndefVarError: `APIUtils` not defined in `CUDA`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:42
 [2] top-level scope
   @ ~/.julia/packages/LuxLib/lngmK/ext/CUDAExt/cublaslt.jl:72
 [3] include(mod::Module, _path::String)
   @ Base ./Base.jl:562
 [4] include(x::String)
   @ CUDAExt ~/.julia/packages/LuxLib/lngmK/ext/CUDAExt/CUDAExt.jl:1
 [5] top-level scope
   @ ~/.julia/packages/LuxLib/lngmK/ext/CUDAExt/CUDAExt.jl:11
 [6] include
   @ ./Base.jl:562 [inlined]
 [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base ./loading.jl:2924
 [8] top-level scope
   @ stdin:6
in expression starting at /home/runner/.julia/packages/LuxLib/lngmK/ext/CUDAExt/cublaslt.jl:72
in expression starting at /home/runner/.julia/packages/LuxLib/lngmK/ext/CUDAExt/CUDAExt.jl:1
in expression starting at stdin:

Looks to me a bug in LuxLib

giordano · 2026-04-18T19:17:50Z

I'm going to assume the "broken KA tests" are https://buildkite.com/julialang/reactant-dot-jl/builds/17726#019d9c7c-885e-47ad-94da-9f8c2b1de61b/L2389. A standalone reproducer is (requires an Nvidia GPU)

julia> using CUDA, KernelAbstractions, Reactant

julia> @kernel function square_kernel!(y, @Const(x))
           i = @index(Global)
           @inbounds y[i] = x[i] * x[i]
       end
square_kernel! (generic function with 4 methods)

julia> function square(x)
           y = similar(x)
           backend = KernelAbstractions.get_backend(x)
           kernel! = square_kernel!(backend)
           kernel!(y, x; ndrange=length(x))
           return y
       end
square (generic function with 1 method)

julia> x = Reactant.to_rarray(collect(1:1:64) ./ 64);

julia> @jit(raise = false, square(x));
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
Warning: detected a stack overflow; program state may be corrupted, so further execution might be unreliable.
ERROR: StackOverflowError:
Stacktrace:
     [1] rethrow()
       @ Base ./error.jl:71
     [2] macro expansion
       @ ./lock.jl:378 [inlined]
     [3] cufunction(f::typeof(gpu_square_kernel!), tt::Type{Tuple{…}}; kwargs::@Kwargs{})
       @ ReactantCUDAExt /mnt/giordano/.julia/dev/Reactant/ext/ReactantCUDAExt.jl:1599
     [4] call_with_reactant
       @ ./none:-1 [inlined]
     [5] call_with_reactant(::Reactant.EnsureReturnType{Any}, ::typeof(cufunction), ::typeof(gpu_square_kernel!), ::Type{Tuple{…}})
       @ Reactant /mnt/giordano/.julia/dev/Reactant/src/utils.jl:0
     [6] #launch_configuration#9
       @ /mnt/giordano/.julia/dev/Reactant/ext/ReactantCUDAExt.jl:614
     [7] call_with_reactant
       @ ./none:-1 [inlined]
     [8] call_with_reactant(::Reactant.EnsureReturnType{…}, ::ReactantCUDAExt.var"##launch_configuration#9", ::Int64, ::Int64, ::typeof(launch_configuration), ::Reactant.Compiler.LLVMFunc{…})
       @ Reactant /mnt/giordano/.julia/dev/Reactant/src/utils.jl:0
--- the above 2 lines are repeated 1 more time ---
--- the above 5 lines are repeated 6868 more times ---
 [34351] ka_with_reactant
       @ /mnt/giordano/.julia/dev/Reactant/ext/ReactantCUDAExt.jl:535
 [34352] call_with_reactant
       @ ./none:-1 [inlined]
 [34353] call_with_reactant(::typeof(Reactant.ka_with_reactant), ::Int64, ::Nothing, ::KernelAbstractions.Kernel{…}, ::Reactant.TracedRArray{…}, ::Reactant.TracedRArray{…})
       @ Reactant /mnt/giordano/.julia/dev/Reactant/src/utils.jl:0
 [34354] (::KernelAbstractions.Kernel{…})(::Reactant.TracedRArray{…}, ::Vararg{…}; ndrange::Int64, workgroupsize::Nothing)
       @ ReactantKernelAbstractionsExt /mnt/giordano/.julia/dev/Reactant/ext/ReactantKernelAbstractionsExt.jl:128
 [34355] square
       @ ./REPL[5]:5
 [34356] call_with_reactant
       @ ./none:-1 [inlined]
 [34357] call_with_reactant(::typeof(square), ::Reactant.TracedRArray{Float64, 1})
       @ Reactant /mnt/giordano/.julia/dev/Reactant/src/utils.jl:0
 [34358] make_mlir_fn(f::typeof(square), args::Tuple{…}, kwargs::@NamedTuple{}, name::String, concretein::Bool; toscalar::Bool, return_dialect::Symbol, args_in_result::Symbol, construct_function_without_args::Bool, do_transpose::Bool, within_autodiff::Bool, input_shardings::Nothing, output_shardings::Nothing, runtime::Val{…}, verify_arg_names::Nothing, argprefix::Symbol, resprefix::Symbol, resargprefix::Symbol, num_replicas::Int64, optimize_then_pad::Bool)
       @ Reactant.TracedUtils /mnt/giordano/.julia/dev/Reactant/src/TracedUtils.jl:370

giordano · 2026-04-18T19:40:12Z

I'll need some help for digging this down. I followed

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 506 to 523 in 2ee58db

    
           function Reactant.ka_with_reactant(ndrange, workgroupsize, obj, args...) 
        
               backend = KA.backend(obj) 
        
               ndrange, workgroupsize, iterspace, dynamic = KA.launch_config( 
        
                   obj, ndrange, workgroupsize 
        
               ) 
        
               # this might not be the final context, since we may tune the workgroupsize 
        
               ctx = KA.mkcontext(obj, ndrange, iterspace) 
        
               # If the kernel is statically sized we can tell the compiler about that 
        
               if KA.workgroupsize(obj) <: KA.StaticSize 
        
                   maxthreads = prod(KA.get(KA.workgroupsize(obj))) 
        
               else 
        
                   maxthreads = nothing 
        
               end 
        
               kernel = CUDA.@cuda launch = false always_inline = backend.always_inline maxthreads = 
        
                   maxthreads obj.f(ctx, args...)

and tried

using CUDA, KernelAbstractions, Reactant

const KA = KernelAbstractions

@kernel function square_kernel!(y, @Const(x))
    i = @index(Global)
    @inbounds y[i] = x[i] * x[i]
end

x = Reactant.to_rarray(collect(1:1:64) ./ 64);
y = similar(x);
backend = KernelAbstractions.get_backend(x)
kernel! = square_kernel!(backend)

ndrange, workgroupsize = length(x), nothing
obj = kernel!
args = (y, x);

ndrange, workgroupsize, iterspace, dynamic = KA.launch_config(
    obj, ndrange, workgroupsize
)
ctx = KA.mkcontext(obj, ndrange, iterspace)
maxthreads = nothing

kernel = CUDA.@cuda launch = false always_inline = backend.always_inline maxthreads =
    maxthreads obj.f(ctx, args...)

but I get various errors both on CUDA.@cuda both on main and this branch, so I'm likely doing something wrong. What's the flow for compiling a KernelAbstraction kernel?

wsmoses · 2026-04-18T19:42:26Z

@maleadt was there any change to cufunction/friends?

giordano · 2026-04-18T19:44:49Z

The stacktrace suggests the stackoverflow happens in CUDA.launch_configuration at

Reactant.jl/ext/ReactantCUDAExt.jl

Line 535 in 3a2f3d4

config = CUDA.launch_configuration(kernel.fun; max_threads=prod(ndrange))

but in my attempt to reduce the error I'm stuck before getting to that line.

giordano · 2026-04-18T20:28:24Z

How does

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 606 to 614 in 2ee58db

    
           @noinline function CUDA.launch_configuration( 
        
               f::LLVMFunc{F,tt}; shmem::Union{Integer,Base.Callable}=0, max_threads::Integer=0 
        
           ) where {F,tt} 
        
               return CUDA.launch_configuration( 
        
                   Base.inferencebarrier(CUDA.cufunction)(f.f, Tuple{tt.parameters[2:end]...}).fun; 
        
                   shmem, 
        
                   max_threads, 
        
               ) 
        
           end

work? cufunction is overlayed at

Reactant.jl/ext/ReactantCUDAExt.jl

Lines 1591 to 1626 in 2ee58db

    
           Reactant.@reactant_overlay @noinline function CUDA.cufunction( 
        
               f::F, tt::TT=Tuple{}; kwargs... 
        
           ) where {F,TT} 
        
               res = Base.@lock CUDA.cufunction_lock begin 
        
                   # compile the function 
        
                   cache = llvm_compiler_cache(MLIR.IR.current_module()) 
        
                   effective_tt = _substitute_bfloat16_tt( 
        
                       tt, Reactant.Compiler.BFLOAT16_COMPILE_TYPE[] 
        
                   ) 
        
                   source = CUDA.methodinstance(F, effective_tt) 
        
                   # cuda = CUDA.active_state() 
        
                   device = nothing # cuda.device 
        
                   # config = CUDA.compiler_config(device; kwargs...)::CUDA.CUDACompilerConfig 
        
                   cuda_cap = v"5.0" 
        
                   cuda_ptx = v"6.3" 
        
                   llvm_cap = v"5.0" 
        
                   llvm_ptx = v"6.3" 
        
                   kernel = true 
        
                   always_inline = false 
        
                   name = nothing 
        
                   debuginfo = false 
        
                   config = GPUCompiler.CompilerConfig( 
        
                       CUDA.PTXCompilerTarget(; cap=llvm_cap, ptx=llvm_ptx, debuginfo), 
        
                       CUDA.CUDACompilerParams(; cap=cuda_cap, ptx=cuda_ptx); 
        
                       kernel, 
        
                       name, 
        
                       always_inline, 
        
                       optimize=false, 
        
                       cleanup=false, 
        
                       validate=false, 
        
                       libraries=false, 
        
                   ) 
        
                   GPUCompiler.cached_compilation(cache, source, config, compile, link) 
        
               end 
        
               return Core.Typeof(res)(f, res.entry) 
        
           end

but as far as I can tell it returns Reactant.Compiler.LLVMFunc also on main, so that the CUDA.launch_configuration is effectively an infinitely recursive function? Did this work by chance so far? Or am I missing something?

wsmoses · 2026-04-18T20:30:58Z

we should really do:

Base.inferencebarrier(CUDA.cufunction)(f.f, Tuple{tt.parameters[2:end]...}).fun;

->

call_with_native(CUDA.cufunction, f.f, Tuple{tt.parameters[2:end]...}).fun;

since essentially the gist there is that within the reactant interp we can call into the native interp result for

giordano · 2026-04-18T20:39:27Z

diff --git a/ext/ReactantCUDAExt.jl b/ext/ReactantCUDAExt.jl
index f1ce6e1dd..6f3c419ef 100644
--- a/ext/ReactantCUDAExt.jl
+++ b/ext/ReactantCUDAExt.jl
@@ -7,7 +7,8 @@ using Reactant:
     AnyConcretePJRTArray,
     MLIR,
     TracedRNumber,
-    ReactantPrecompilationException
+    ReactantPrecompilationException,
+    call_with_native
 using Reactant.Compiler: raising, LLVMFunc, llvm_compiler_cache
 using Reactant.Ops: @opcall
 
@@ -612,7 +613,7 @@ end
     f::LLVMFunc{F,tt}; shmem::Union{Integer,Base.Callable}=0, max_threads::Integer=0
 ) where {F,tt}
     return CUDA.launch_configuration(
-        Base.inferencebarrier(CUDA.cufunction)(f.f, Tuple{tt.parameters[2:end]...}).fun;
+        call_with_native(CUDA.cufunction, f.f, Tuple{tt.parameters[2:end]...}).fun;
         shmem,
         max_threads,
     )

does fix the issue!

…function, ...)`

wsmoses

If it passes LGTM

giordano · 2026-04-18T21:25:07Z

I believe we need to wait LuxDL/Lux.jl#1696 for a clearer run, Lux integration tests are going to fail without that

avik-pal · 2026-04-18T23:05:30Z

JuliaRegistries/General#153282

giordano added 6 commits March 30, 2026 15:21

Adapt to CUDA.jl v6

89635b0

Make Printf explicit trigger of ReactantCUDAExt

d307803

CUDA.@device_function -> CUDA.CUDACore.@device_function

72fff65

Explicitly load CUDACore

39ad409

Call @device_function from CUDACore

4c471f4

More CUDA -> CUDACore

eb5a310

giordano marked this pull request as draft March 30, 2026 16:57

wsmoses reviewed Mar 30, 2026

View reviewed changes

giordano mentioned this pull request Mar 30, 2026

Adapt to CUDA.jl v6 CliMA/Oceananigans.jl#5458

Merged

giordano added 4 commits March 31, 2026 05:05

Refer to methodinstance from its righteous owner

2d8803b

Refer to PTXCompilerTarget from its righteous owner

ca192fc

CUDA.CUDACompilerParams -> CUDACore.CUDACompilerParams

0e96b9a

Back to CUDA.@device_function

4a142dd

giordano force-pushed the mg/cuda-6 branch from 881f0f2 to 2cf3532 Compare March 31, 2026 10:06

Keep full compatibility between CUDA v5 and v6

cff4b0a

giordano force-pushed the mg/cuda-6 branch from 2cf3532 to cff4b0a Compare March 31, 2026 10:14

giordano mentioned this pull request Apr 12, 2026

Update CUDA requirement from 5.9 to 5.9, 6.0 #2813

Closed

Merge remote-tracking branch 'origin/main' into mg/cuda-6

b0bc8bc

github-actions Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread ext/ReactantCUDAExt.jl Outdated

Update ext/ReactantCUDAExt.jl

6798f75

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

mofeing mentioned this pull request Apr 17, 2026

can't set the CPU backend #2774

Closed

giordano added 2 commits April 17, 2026 18:25

Merge branch 'main' into mg/cuda-6

3f89ba6

Bump version number to v0.2.253

3a2f3d4

giordano marked this pull request as ready for review April 17, 2026 17:27

giordano requested a review from wsmoses April 17, 2026 17:30

giordano added 2 commits April 18, 2026 15:45

Base.inferencebarrier(CUDA.cufunction) -> `call_with_native(CUDA.cu…

8ad06d8

…function, ...)`

Merge remote-tracking branch 'origin/main' into mg/cuda-6

b32a827

wsmoses approved these changes Apr 18, 2026

View reviewed changes

Merge branch 'main' into mg/cuda-6

35717f4

avik-pal merged commit 52def24 into main Apr 19, 2026
117 of 126 checks passed

avik-pal deleted the mg/cuda-6 branch April 19, 2026 15:16

Conversation

giordano commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsmoses commented Mar 30, 2026

Uh oh!

wsmoses Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

giordano Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

giordano Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giordano commented Mar 30, 2026

Uh oh!

giordano commented Mar 30, 2026

Uh oh!

Uh oh!

giordano commented Apr 17, 2026

Uh oh!

avik-pal commented Apr 18, 2026

Uh oh!

giordano commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Apr 18, 2026

Uh oh!

giordano commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsmoses commented Apr 18, 2026

Uh oh!

giordano commented Apr 18, 2026

Uh oh!

giordano commented Apr 18, 2026

Uh oh!

wsmoses commented Apr 18, 2026 • edited by giordano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Apr 18, 2026

Uh oh!

wsmoses left a comment

Choose a reason for hiding this comment

Uh oh!

giordano commented Apr 18, 2026

Uh oh!

avik-pal commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

giordano commented Mar 30, 2026 •

edited

Loading

giordano Mar 31, 2026 •

edited

Loading

giordano commented Apr 18, 2026 •

edited

Loading

giordano commented Apr 18, 2026 •

edited

Loading

giordano commented Apr 18, 2026 •

edited

Loading

wsmoses commented Apr 18, 2026 •

edited by giordano

Loading