Coder Social home page Coder Social logo

gpucompiler.jl's People

Contributors

ali-ramadhan avatar aviatesk avatar bors[bot] avatar chriselrod avatar collinwarner avatar dependabot[bot] avatar dilumaluthge avatar dpsanders avatar eschnett avatar femtomc avatar gbaraldi avatar giordano avatar github-actions[bot] avatar jonas-schulze avatar jpsamaroo avatar lcw avatar maleadt avatar pchintalapudi avatar qin-yu avatar ranocha avatar seelengrab avatar simonbyrne avatar sjkelly avatar tgymnich avatar tkf avatar vchuravy avatar wangwillian0 avatar wsmoses avatar zentrik avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpucompiler.jl's Issues

Method overrides do not re-trigger compilation

julia> kernel() = child()
kernel (generic function with 1 method)

julia> child() = 0
child (generic function with 1 method)

julia> native_code_llvm(kernel, Tuple{})
;  @ REPL[1]:1 within `kernel`
; Function Attrs: alwaysinline
define i64 @julia_kernel_755() local_unnamed_addr #0 {
top:
  ret i64 0
}

julia> GPUCompiler.@override method_table child() = 1

julia> native_code_llvm(kernel, Tuple{})
;  @ REPL[1]:1 within `kernel`
; Function Attrs: alwaysinline
define i64 @julia_kernel_1124() local_unnamed_addr #0 {
top:
  ret i64 0
}

Rewrite calls to eg. `Base.sin`

We should figure out a way to support users calling Base.sin instead of CUDAnative.sin. One possible option would be to hook inference, overriding method calls when operating under @cuda. The same mechanism could be used to dispatch on the hardware generation.

Some compilation failures result in nasty errors

e.g. with LLVM.jl on the breaking changes PR but GPUCompiler/CUDA.jl not, CUDA.jl gives:

ERROR: LoadError: LoadError: AssertionError: Core.Compiler.haskey(wvc, mi)
Stacktrace:
  [1] ci_cache_populate(interp::GPUCompiler.GPUInterpreter, cache::GPUCompiler.CodeCache, mt::Nothing, mi::Core.MethodInstance, min_world::UInt64, max_world::Int32)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/jlgen.jl:284
  [2] compile_method_instance(job::GPUCompiler.CompilerJob, method_instance::Core.MethodInstance)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/jlgen.jl:327
  [3] macro expansion
    @ ~/Julia/depot/packages/TimerOutputs/PZq45/src/TimerOutput.jl:226 [inlined]
  [4] irgen(job::GPUCompiler.CompilerJob, method_instance::Core.MethodInstance)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/irgen.jl:4
  [5] macro expansion
    @ ~/Julia/pkg/GPUCompiler/src/driver.jl:149 [inlined]
  [6] macro expansion
    @ ~/Julia/depot/packages/TimerOutputs/PZq45/src/TimerOutput.jl:226 [inlined]
  [7] macro expansion
    @ ~/Julia/pkg/GPUCompiler/src/driver.jl:148 [inlined]
  [8] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, only_entry::Bool)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/utils.jl:62
  [9] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/driver.jl:58
 [10] emit_function!(mod::LLVM.Module, job::GPUCompiler.CompilerJob, f::Function, method::GPUCompiler.Runtime.RuntimeMethodInstance)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/rtlib.jl:68
 [11] build_runtime(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ GPUCompiler ~/Julia/pkg/GPUCompiler/src/rtlib.jl:114

parameters from ExprTools and LLVM

Both ExprTools.jl and LLVM.jl export parameters, so I got the following warning:

WARNING: both ExprTools and LLVM export "parameters"; uses of it in module GPUCompiler must be qualified

Revising recursive function results in `InvalidIRError`

When running this CUDA.jl code with Revise.jl

using CUDA

function recursive_sub(b::Int64)
    if b > 0
        recursive_sub(b-1)
    end
    return
end

function reproduce_main()

    function recursive_kernel()
        recursive_sub(5)
        return
    end

    @cuda threads=5 recursive_kernel()
end

everything works as expected, but when you change recursive_sub with something like a print statement the kernel cannot execute anymore due to InvalidIRError after something that looks like a stackoverflow in GPUCompiler.

Stacktrace
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:72
invalidate at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:53 [inlined]
callback at /home/ajuvercr/.julia/packages/GPUCompiler/XwWPj/src/jlgen.jl:38
unknown function (ip: 0x7f7d47e39bd8)
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
jl_apply at /home/ajuvercr/julia-1.6/src/julia.h:1703 [inlined]
invalidate_external at /home/ajuvercr/julia-1.6/src/gf.c:1339
jl_method_table_invalidate at /home/ajuvercr/julia-1.6/src/gf.c:1561
jl_method_table_disable at /home/ajuvercr/julia-1.6/src/gf.c:1582
delete_method at ./reflection.jl:1535 [inlined]
#79 at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:276
unknown function (ip: 0x7f7d47e16118)
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
with_logstate at ./logging.jl:491
with_logger at ./logging.jl:603 [inlined]
delete_missing! at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:242 [inlined]
delete_missing! at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:304
handle_deletions at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:647
#revise#98 at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:746
revise at /home/ajuvercr/.julia/packages/Revise/VxkZO/src/packagedef.jl:734
unknown function (ip: 0x7f7d47df008d)
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
jl_apply at /home/ajuvercr/julia-1.6/src/julia.h:1703 [inlined]
jl_f__call_latest at /home/ajuvercr/julia-1.6/src/builtins.c:714
#invokelatest#2 at ./essentials.jl:708 [inlined]
invokelatest at ./essentials.jl:706
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
jl_apply at /home/ajuvercr/julia-1.6/src/julia.h:1703 [inlined]
do_call at /home/ajuvercr/julia-1.6/src/interpreter.c:115
eval_value at /home/ajuvercr/julia-1.6/src/interpreter.c:204
eval_stmt_value at /home/ajuvercr/julia-1.6/src/interpreter.c:155 [inlined]
eval_body at /home/ajuvercr/julia-1.6/src/interpreter.c:557
jl_interpret_toplevel_thunk at /home/ajuvercr/julia-1.6/src/interpreter.c:669
top-level scope at none:1
jl_toplevel_eval_flex at /home/ajuvercr/julia-1.6/src/toplevel.c:877
jl_toplevel_eval_flex at /home/ajuvercr/julia-1.6/src/toplevel.c:825
jl_toplevel_eval_in at /home/ajuvercr/julia-1.6/src/toplevel.c:929
eval at ./boot.jl:360 [inlined]
eval_user_input at /home/ajuvercr/julia-1.6/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:139
repl_backend_loop at /home/ajuvercr/julia-1.6/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:200
start_repl_backend at /home/ajuvercr/julia-1.6/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:185
#run_repl#42 at /home/ajuvercr/julia-1.6/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:317
run_repl at /home/ajuvercr/julia-1.6/usr/share/julia/stdlib/v1.6/REPL/src/REPL.jl:305
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
#876 at ./client.jl:387
jfptr_YY.876_44055 at /home/ajuvercr/julia-1.6/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
jl_apply at /home/ajuvercr/julia-1.6/src/julia.h:1703 [inlined]
jl_f__call_latest at /home/ajuvercr/julia-1.6/src/builtins.c:714
#invokelatest#2 at ./essentials.jl:708 [inlined]
invokelatest at ./essentials.jl:706 [inlined]
run_main_repl at ./client.jl:372
exec_options at ./client.jl:302
_start at ./client.jl:485
jfptr__start_49096 at /home/ajuvercr/julia-1.6/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /home/ajuvercr/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/ajuvercr/julia-1.6/src/gf.c:2419
jl_apply at /home/ajuvercr/julia-1.6/src/julia.h:1703 [inlined]
true_main at /home/ajuvercr/julia-1.6/src/jlapi.c:560
repl_entrypoint at /home/ajuvercr/julia-1.6/src/jlapi.c:702
main at julia-1.6 (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at julia-1.6 (unknown line)
ERROR: InvalidIRError: compiling kernel recursive_kernel() resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to recursive_sub(b::Int64) in Main at /home/ajuvercr/test.jl:12)
Stacktrace:
 [1] recursive_sub
   @ ~/test.jl:15
 [2] recursive_kernel
   @ ~/test.jl:23
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{var"#recursive_kernel#1", Tuple{}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/validation.jl:123
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:288 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:286 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:306
  [7] check_cache
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/cache.jl:44 [inlined]
  [8] cached_compilation
    @ ~/test.jl:23 [inlined]
  [9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{var"#recursive_kernel#1", Tuple{}}}, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/cache.jl:0
 [10] cufunction(f::var"#recursive_kernel#1", tt::Type{Tuple{}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:294
 [11] cufunction
    @ ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:288 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:102 [inlined]
 [13] reproduce_main()
    @ Main ~/test.jl:26
 [14] top-level scope
    @ REPL[2]:1

Revisit kernel wrapping / byval lowering

#37 has reinstated the old pass that replaces pointer arguments with their value, which is bad: it is coupled to Julia semantics (what is and isn't passed), and is pretty fragile in general. I don't remember why we need to deal with Julia's ABI though; why can't we just detect the byval (pointer) arguments and replace them with a value just like NVPTX does? https://github.com/llvm-mirror/llvm/blob/2c4ca6832fa6b306ee6a7010bfb80a3f2596f824/lib/Target/NVPTX/NVPTXLowerArgs.cpp#L154-L176

irgen nondeterminism on Julia 1.5

using CUDA

try
    CUDA.CuDim3(typemax(Int64))
catch
    # InexactError(:trunc, UInt32, 9223372036854775807)
end

function kernel(a)
    i = threadIdx().x
    a[] = Int(i / 2)
    return
end

a = CuArray([0])
@cuda kernel(a)

Without the throwing call to CuDim3:

L26:                                              ; preds = %top
  call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception2 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  br label %L24

With it:

L26:                                              ; preds = %top
  %14 = call %jl_value_t* @gpu_gc_pool_alloc(i64 8)
  %15 = bitcast %jl_value_t* %14 to double*
  store double %4, double* %15, align 8
  call fastcc void @julia_InexactError_1371([3 x %jl_value_t*]* noalias nocapture nonnull sret %1, %jl_value_t* inttoptr (i64 140248073768816 to %jl_value_t*), %jl_value_t* readonly inttoptr (i64 140248168186208 to %jl_value_t*), %jl_value_t* readonly %14)
  call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception2 to i64))
  call fastcc void @gpu_signal_exception()
  call void asm sideeffect "exit;", ""() #4
  br label %L24

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Compilation of Optimisers.jl pipeline

The setup for this is a bit involved, but hopefully the core issue is not affected much by most of it, and I've tried to reduce the MWE somewhat, and the issue likely isn't linked to the specifics of the packages involved (mostly Flux, Zygote, Functors, Optimisers etc)

julia> using Flux, Zygote, Optimisers

julia> m = Chain(Conv((3,3), 3 => 4),
                 BatchNorm(4),
                 Flux.flatten);

julia> gm = gpu(m);

julia> gx = rand(Float32, 3, 3, 3, 2) |> gpu;

julia> gm̄, gx̄ = gradient(gm, gx) do m, x
         sum(m(x))
       end;

julia> opt = Optimisers.ADAM();

julia> gst = Optimisers.state(opt, gm);

julia> typeof(gst)
Tuple{NamedTuple{(, :weight, :bias, :stride, :pad, :dilation), Tuple{Nothing, Tuple{CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 4}, Tuple{Float32, Float32}}, Tuple{CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Tuple{Float32, Float32}}, Tuple{Nothing, Nothing}, NTuple{4, Nothing}, Tuple{Nothing, Nothing}}}, NamedTuple{(, , , , :σ², , :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Tuple{CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Tuple{Float32, Float32}}, Tuple{CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Tuple{Float32, Float32}}, Tuple{CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Tuple{Float32, Float32}}, Tuple{CUDA.CuArray{Float32, 1}, CUDA.CuArray{Float32, 1}, Tuple{Float32, Float32}}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing}

julia> Optimisers.update(opt, gm, gm̄, gst)
ERROR: GPU compilation of kernel broadcast_kernel(CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Floa
t32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Float32, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}} which is not isbits.
    .2 is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(*),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool},
 NTuple{4, Int64}}}} which is not isbits.
      .args is of type Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}} which is not isbits.
         .1 is of type Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(-), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32}} which is not isbits.
          .args is of type Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}}, Float32} which is not isbits.
            .1 is of type Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(one), Tuple{CUDA.CuRefValue{DataType}}} which is not isbits.
              .args is of type Tuple{CUDA.CuRefValue{DataType}} which is not isbits.
                .1 is of type CUDA.CuRefValue{DataType} which is not isbits.
                  .x is of type DataType which is not isbits.
                    .name is of type Core.TypeName which is not isbits.
                    .super is of type DataType which is not isbits.
                    .parameters is of type Core.SimpleVector which is not isbits.
                    .types is of type Core.SimpleVector which is not isbits.
                    .names is of type Core.SimpleVector which is not isbits.
                    .instance is of type Any which is not isbits.

This is slightly different from the error I used to get which complained about Refs being passed, but this looks simpler since it seems like CuArrayStyle was not applied to certain places, or it mixed with CPU memory somehow. Worth noting that some other optimisers fare better (Momentum works)

Of note is

julia> typeof(gm̄)
NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(, :weight, :bias, :stride, :pad, :dilation), Tuple{Nothing, CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}, Nothing, Nothing, Nothing}}, Base.RefValue{Any}, Nothing}}}

which contains the Ref in the gradient of the BatchNorm.

This currently depends on a modified Optimisers.jl#dg/state which has a couple of updates to show the issue. I should push the latest changes there so its easier to discuss.

cc @maleadt

Opening this anyway so there is an mwe to refer back to.

Method overrides cannot override kernels

julia> kernel() = 0
kernel (generic function with 1 method)

julia> GPUCompiler.@override method_table kernel() = 1

julia> native_code_llvm(kernel, Tuple{})
;  @ REPL[1]:1 within `kernel`
; Function Attrs: alwaysinline
define i64 @julia_kernel_683() local_unnamed_addr #0 {
top:
  ret i64 0
}
```

The reason for this is that we implement `Core.Compiler.method_table(interp::GPUInterpreter, sv::InferenceState)`, whereas for the outermost function `Core.Compiler.method_table(interp::GPUInterpreter)` seem to be called. But we can't implement that method, because that breaks InferenceState construction:

```
ERROR: LoadError: MethodError: no method matching convert(::Type{Core.Compiler.CachedMethodTable{Core.Compiler.InternalMethodTable}}, ::Core.Compiler.CachedMethodTable{Core.Compiler.OverlayMethodTable})
Closest candidates are:
  convert(::Type{Union{}}, ::Any) at essentials.jl:216
  convert(::Type{Any}, ::Any) at essentials.jl:217
  convert(::Type{T}, ::T) where T at essentials.jl:218
Stacktrace:
  [1] Core.Compiler.InferenceState(result::Core.Compiler.InferenceResult, src::Core.CodeInfo, cached::Bool, interp::GPUCompiler.GPUInterpreter)
    @ Core.Compiler ./compiler/inferencestate.jl:106
```

Better world age handling

There's currently two related problems with our current world age handling:

  • we index the compilation cache directly with the world age (by taking the hash of the compiler job); instead we should intersect, and handle invalidation
  • when creating the FunctionSpec, we should have a way to pass the world age of the calling code, as opposed to the current global world age, which makes e.g. cufunction always look up the latest version of a potentially shadowed kernel (whereas it should be consistent with the world age of the caller); this probably requires a new intrinsic as both the current world age getter and the PTLS entry always point to the latest world:
     julia> worlds() = (ptls=Int(unsafe_load(convert(Ptr{Csize_t}, Core.getptls() + 8))), runtime=Int(Base.get_world_counter()))
     worlds (generic function with 1 method)
     julia> worlds()
     (ptls = 29611, runtime = 29611)
     julia> bar() = 42
     bar (generic function with 1 method)
     julia> worlds()
     (ptls = 29612, runtime = 29612)
     julia> Int(first(methods(worlds, Tuple{})).primary_world)
     29611

Validator does not catch global host loads

julia> using CUDA

julia> const foo = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> a = CUDA.zeros(Int, 1)
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 0

julia> function kernel(a, i)
           @inbounds a[i] = foo[i]
           return
       end
kernel (generic function with 1 method)

julia> @device_code_llvm @cuda kernel(a, 1)
; PTX CompilerJob of kernel kernel(CuDeviceVector{Int64, 1}, Int64) for sm_75
;  @ REPL[5]:1 within `kernel`
define ptx_kernel void @_Z17julia_kernel_295113CuDeviceArrayI5Int64Li1ELi1EES0_({ i8 addrspace(1)*, i64, [1 x i64] } %0, i64 signext %1) local_unnamed_addr #0 {
entry:
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [1 x i64] } %0, 0
;  @ REPL[5]:2 within `kernel`
; ┌ @ array.jl:835 within `getindex`
   %2 = add i64 %1, -1
   %3 = load i64*, i64** inttoptr (i64 139929053628784 to i64**), align 16
   %4 = getelementptr inbounds i64, i64* %3, i64 %2
   %5 = load i64, i64* %4, align 8
; └
; ┌ @ /home/tim/Julia/pkg/CUDA/src/device/array.jl:192 within `setindex!`
; │┌ @ /home/tim/Julia/pkg/CUDA/src/device/array.jl:153 within `arrayset`
; ││┌ @ /home/tim/Julia/pkg/CUDA/src/device/array.jl:162 within `arrayset_bits`
; │││┌ @ /home/tim/Julia/pkg/LLVM/src/interop/pointer.jl:84 within `unsafe_store!`
; ││││┌ @ /home/tim/Julia/pkg/LLVM/src/interop/pointer.jl:44 within `pointerset`
; │││││┌ @ /home/tim/Julia/pkg/LLVM/src/interop/pointer.jl:44 within `macro expansion` @ /home/tim/Julia/pkg/LLVM/src/interop/base.jl:39
        %6 = bitcast i8 addrspace(1)* %.fca.0.extract to i64 addrspace(1)*
        %7 = getelementptr inbounds i64, i64 addrspace(1)* %6, i64 %2
        store i64 %5, i64 addrspace(1)* %7, align 8
        ret void
; └└└└└└
}
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:105
 [2] query
   @ ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:102 [inlined]
 [3] synchronize(stream::CuStream; blocking::Bool)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:130
 [4] synchronize (repeats 2 times)
   @ ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:117 [inlined]
 [5] top-level scope
   @ ~/Julia/pkg/CUDA/src/initialization.jl:67

Ref JuliaGPU/CUDA.jl#1085

Validation uses wrong interpreter

Describe the bug

Some math operations in CUDA/src/device/intrinsics/math.jl give compile errors.

It seems that operations that replace ones in Base and SpecialFunctions work fine.
Those that do not have a definition elsewhere give a "KernelError: kernel returns a value of type Union{}."
Those that redefine operations in FastMath give "invalid LLVM IR Reason: unsupported dynamic function invocation."

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA
a = CUDA.ones(10)
function kernel(a)
    i = threadIdx().x
    a[i] = CUDA.rsqrt(a[i])
    return nothing
end
@cuda threads=length(a) kernel(a)
--> ERROR: LoadError: GPU compilation of kernel kernel(CuDeviceVector{Float32, 1}) failed KernelError: kernel 
returns a value of type `Union{}`
Manifest.toml

CUDA v3.0.0
GPUArrays v6.2.2
GPUCompiler v0.11.2
LLVM v3.6.0

Version info

Details on Julia:
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = "C:\Program Files\Microsoft VS Code\Code.exe"
JULIA_NUM_THREADS = 6

Details on CUDA:
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.1.0
NVIDIA driver 456.71.0

Libraries:

  • CUBLAS: 11.3.0
  • CURAND: 10.2.2
  • CUFFT: 10.3.0
  • CUSOLVER: 11.0.1
  • CUSPARSE: 11.3.0
  • CUPTI: 14.0.0
  • NVML: 11.0.0+456.71
  • CUDNN: 8.10.0 (for CUDA 11.2.0)
  • CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:

  • Julia: 1.6.0
  • LLVM: 11.0.1
  • PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
  • Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
0: GeForce RTX 2070 (sm_75, 5.286 GiB / 8.000 GiB available)

Allow passing multiple signatures to codegen

For codegen of multiple functions and referenced globals into a single LLVM module. Useful for eBPF support (globals can be meaningfully shared across kernels), and static compilation of native Julia code (for generating native executables/shared libraries).

Both ExprTools and LLVM export "parameters"

With version 0.1.4 of ExprTools (see this commit) both ExprTools and LLVM export "parameters". This causes errors like the following when using CUDA.jl.

WARNING: both ExprTools and LLVM export "parameters"; uses of it in module CUDA must be qualified
cells: Error During Test at /home/lwilcox/builds/ay14VxBm/0/lwilcox/Bennu.jl/test/cells.jl:1
  Got exception outside of a @test
  UndefVarError: parameters not defined
  Stacktrace:
    [1] classify_arguments(job::GPUCompiler.CompilerJob, codegen_f::LLVM.Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/irgen.jl:298
    [2] lower_byval(job::GPUCompiler.CompilerJob, mod::LLVM.Module, entry_f::LLVM.Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/irgen.jl:368
    [3] process_entry!(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, P, F} where {P, F}, mod::LLVM.Module, entry::LLVM.Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/ptx.jl:103
    [4] irgen(job::GPUCompiler.CompilerJob, method_instance::Core.MethodInstance)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/irgen.jl:52
    [5] macro expansion
      @ ~/.julia/packages/GPUCompiler/OyOEp/src/driver.jl:149 [inlined]
    [6] macro expansion
      @ ~/.julia/packages/TimerOutputs/ULC3Q/src/TimerOutput.jl:237 [inlined]
    [7] macro expansion
      @ ~/.julia/packages/GPUCompiler/OyOEp/src/driver.jl:148 [inlined]
    [8] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any; libraries::Bool, deferred_codegen::Bool, optimize::Bool, only_entry::Bool)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/utils.jl:62
    [9] emit_llvm
      @ ~/.julia/packages/GPUCompiler/OyOEp/src/utils.jl:60 [inlined]
   [10] macro expansion
      @ ~/.julia/packages/TimerOutputs/ULC3Q/src/TimerOutput.jl:221 [inlined]
   [11] macro expansion
      @ ~/.julia/packages/CUDA/Ozu5O/src/compiler/execution.jl:316 [inlined]
   [12] cufunction_compile(job::GPUCompiler.CompilerJob)
      @ CUDA ~/.julia/packages/TimerOutputs/ULC3Q/src/TimerOutput.jl:221
   [13] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/OyOEp/src/cache.jl:89
   [14] macro expansion
      @ ~/.julia/packages/CUDA/Ozu5O/src/compiler/execution.jl:288 [inlined]
   [15] cufunction(f::GPUArrays.var"#broadcast_kernel#16", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{SVector{2, Float32}, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, CUDA.var"#282#283"{SVector{S, T} where {S, T}}, Tuple{Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ CUDA ~/.julia/packages/TimerOutputs/ULC3Q/src/TimerOutput.jl:221
   [16] cufunction
      @ ~/.julia/packages/TimerOutputs/ULC3Q/src/TimerOutput.jl:214 [inlined]
   [17] macro expansion
      @ ~/.julia/packages/CUDA/Ozu5O/src/compiler/execution.jl:102 [inlined]
   [18] #launch_heuristic#238
      @ ~/.julia/packages/CUDA/Ozu5O/src/gpuarrays.jl:17 [inlined]
   [19] launch_heuristic
      @ ~/.julia/packages/CUDA/Ozu5O/src/gpuarrays.jl:17 [inlined]
   [20] copyto!
      @ ~/.julia/packages/GPUArrays/8dzSJ/src/host/broadcast.jl:63 [inlined]
   [21] copyto!
      @ ./broadcast.jl:936 [inlined]
   [22] copy
      @ ~/.julia/packages/GPUArrays/8dzSJ/src/host/broadcast.jl:47 [inlined]
   [23] materialize(bc::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, CUDA.var"#282#283"{SVector{S, T} where {S, T}}, Tuple{CuArray{Float32, 2}, CuArray{Float32, 2}}})
      @ Base.Broadcast ./broadcast.jl:883
   [24] (LobattoCell{Float32, CuArray, S, N, O, P, D, M, FM, E, C} where {S, N, O, P, D, M, FM, E, C})(::Int64, ::Vararg{Int64, N} where N)
      @ Bennu ~/builds/ay14VxBm/0/lwilcox/Bennu.jl/src/cells.jl:52
   [25] macro expansion
      @ ~/builds/ay14VxBm/0/lwilcox/Bennu.jl/test/cells.jl:10 [inlined]
   [26] macro expansion
      @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1151 [inlined]
   [27] top-level scope
      @ ~/builds/ay14VxBm/0/lwilcox/Bennu.jl/test/cells.jl:2
   [28] include(fname::String)
      @ Base.MainInclude ./client.jl:444
   [29] top-level scope
      @ ~/builds/ay14VxBm/0/lwilcox/Bennu.jl/test/runtests.jl:25
   [30] include(fname::String)
      @ Base.MainInclude ./client.jl:444
   [31] top-level scope
      @ none:6
   [32] eval
      @ ./boot.jl:360 [inlined]
   [33] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:261
   [34] _start()
      @ Base ./client.jl:485

PTX: byval results in local memory usage

Trying to migrate to CUDA.jl from CUDAdrv.jl (and Julia 1.3.1) but can't still due to performance issues. Here is an (artificial) example I managed to boil down to. I have this test code add_kernel.jl, where I have an array of 2d arrays and I want add it to another array of 2d arrays :

const threads = 256

#simple add matrixes kernel
function kernel_add_mat(n, x1, x2, y)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

@inline get_inputs3(indx_y, a, b, c)                                        = (a, b, c)
@inline get_inputs3(indx_y, a1, a2, b1, b2, c1, c2)                         = indx_y == 1 ? (a1, b1, c1) : (a2, b2, c2)
@inline get_inputs3(indx_y, a1, a2, a3, b1, b2, b3, c1, c2, c3)             = indx_y == 1 ? (a1, b1, c1) : indx_y == 2 ? (a2, b2, c2) : (a3, b3, c3)

#add arrays of matrixes kernel
function kernel_add_mat_z_slices(n, vararg...)
    x1, x2, y = get_inputs3(blockIdx().y, vararg...)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= n
        @inbounds y[i] = x1[i] + x2[i]
    end
    return
end

function add_z_slices!(y, x1, x2)
    m1, n1 = size(x1[1]) #get size of first slice
    blocks = (m1 * n1 + threads - 1) ÷ threads
    #get length(x1) more blocks than needed to process 1 slice
    @cuda blocks = blocks, length(x1) threads = threads kernel_add_mat_z_slices(m1 * n1, x1..., x2..., y...)
end

function add!(y, x1, x2)
    m1, n1 = size(x1)
    blocks = (m1 * n1 + threads - 1) ÷ threads
    @cuda blocks = blocks, 1          threads = threads kernel_add_mat(m1 * n1, x1, x2, y)
end

num_z_slices = 3
Random.seed!(1)

#m, n = 7, 5          # tiny to measure overhead
#m, n = 521, 111
#m, n = 1521, 1111
#m, n = 3001, 1511    # prime numbers to test memory access correctness
m, n = 3072, 1536    # 256 multiplier
#m, n = 6007, 3001    # prime numbers to test memory access correctness
    
x1 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
x2 = [cu(randn(Float32, (m, n)) .+ Float32(0.5)) for i = 1:num_z_slices]
y1 = [similar(x1[1]) for i = 1:num_z_slices]

#reference down to bones add on GPU
add!(y1[1], x1[1], x2[1])
print("add!                       "); 
@btime begin add!($y1[1], $x1[1], $x2[1]); synchronize() end

#adding arrays in an array
for slices = 1:num_z_slices
    add_z_slices!(y1[1:slices], x1[1:slices], x2[1:slices])
    print("add_z_slices!, slices = $slices  "); 
    @btime begin add_z_slices!($y1[1:$slices], $x1[1:$slices], $x2[1:$slices]); synchronize() end
end

in julia 1.3.1 with CUDADrv (Win 10, GTX 2070, I have only 1 card.) I get the following numbers

add!                         167.600 μs (35 allocations: 1.11 KiB)
add_z_slices!, slices = 1    171.101 μs (75 allocations: 2.97 KiB)
add_z_slices!, slices = 2    313.199 μs (93 allocations: 4.09 KiB)
add_z_slices!, slices = 3    456.600 μs (111 allocations: 5.41 KiB) 

there is a slight overhead for adding z slicing code in case of just one slice. and then time grows reasonably linearly with the number of z slices increasing. however with julia 1.5.2 and CUDA.jl I get the following numbers:

add!                         167.100 μs (14 allocations: 400 bytes)
add_z_slices!, slices = 1    169.800 μs (56 allocations: 2.33 KiB)
add_z_slices!, slices = 2    4.536 ms (68 allocations: 3.02 KiB)
add_z_slices!, slices = 3    2.435 ms (80 allocations: 3.83 KiB)

i.e. totally unreasonable growth with slices 2 and 3 (yet same performance with simple add kernel and even just one slice). Looks like an issue in CUDA.jl

here is how I setup julia 1.3.1 env

using Pkg
Pkg.activate("CUDAdrv")
pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDAdrv"         => v"5.0.1",
               "CUDAnative"      => v"2.7.0",
               "CuArrays"        => v"1.6.0",
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is how I setup julia 1.5.2 env

using Pkg
Pkg.activate("Cuda")

pkg_ref = Dict("BenchmarkTools"  => v"0.5.0",
               "CUDA"         => v"2.0.0"
               )

for pkg in keys(pkg_ref)
    println("install $pkg => ", pkg_ref[pkg])
    Pkg.add(PackageSpec(name=pkg, version=pkg_ref[pkg]))
end

and this is code to call the kernels in 1.3.1 add_kernel_1.3.1.jl:

using Pkg
Pkg.activate("CUDAdrv")

using BenchmarkTools, Printf, Random
using CUDAdrv: CuDevice, CuContext, DeviceSet, CuDefaultStream, synchronize, unsafe_destroy!, CUctx_flags, devices

use_dev = 0
dev = CuDevice(use_dev)
ctx = CuContext(dev)
using CUDAnative
CUDAnative.device!(use_dev)
using CuArrays
include("add_kernel.jl")

this is code to call the kernels in 1.5.2 add_kernel_1.5.2.jl:

using Pkg
Pkg.activate("Cuda")
using BenchmarkTools, Printf, Random, CUDA
include("add_kernel.jl")

and I call it like this
C:\Bin\Julia-1.3.1\bin\julia.exe add_kernel_1.3.1.jl
C:\Bin\Julia 1.5.2\bin\julia.exe add_kernel_1.5.2.jl

aj

Make exception verbosity configurable?

Currently exception reporting is done based on Julia's debug level, however on the host, stackframes are shown regardless of the debug level (it would be silly to not get stacktraces just because you aren't using julia-debug). Could we make verbose exceptions (with stacktraces) configurable based on some job parameter?

Duplicate metadata

!13 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!14 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!15 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !12, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!16 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!17 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!18 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !12, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!19 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!20 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !12, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!21 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!22 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!23 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!24 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!25 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!26 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!27 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!28 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!29 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!30 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!31 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!32 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!33 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!34 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!35 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !8, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)
!36 = distinct !DICompileUnit(language: DW_LANG_Julia, file: !12, producer: "julia", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly, enums: !4, nameTableKind: None)

Make Julia's optimization passes configurable

The AMDGPU target's default datalayout causes issues with the LLVM verifier on debug builds because the verifier is run so early in the optimization pipeline. We can fix this by allowing backends to override running Julia's optimization pass pipeline, and reordering the individual passes to better suit the backend's needs.

Non-integral address-spaces in datalayout

@wsmoses just found a miss-compilation in Enzyme.jl due to the fact that, that GPUCompiler (if not opted out) set the datalayout of the module to the basic machine one

GPUCompiler.jl/src/jlgen.jl

Lines 247 to 249 in 13d08b2

if llvm_datalayout(job.target) !== nothing
datalayout!(llvm_mod, llvm_datalayout(job.target))
end

IIUC the GCN/SPIR-V/PTX backends don't add -ni:10:11:12:13 to the data-layout and might be similarly prone to miss-compilation (less likely since we don't have code that allocates as muc), iirc SCEV introduced a inttoptr and the gc-verifier was immensely unhappy about that.

Int128 codegen

We currently sometimes emit i128:

julia> function kernel(A)
       @inbounds A[1, 2] += 1
       return
       end
kernel (generic function with 1 method)

julia> @device_code_llvm @oneapi kernel(reshape(oneArray{Float32}(undef, (1,1))', 1, 1))
; CompilerJob of kernel kernel(Base.ReshapedArray{Float32,2,LinearAlgebra.Adjoint{Float32,oneDeviceArray{Float32,2,1}},Tuple{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64}}}) for GPUCompiler.SPIRVCompilerTarget
define spir_kernel void @_Z18julia_kernel_1195413ReshapedArrayI7Float32Li2E7AdjointIS0_14oneDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE({ { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } }* byval) local_unnamed_addr {
entry:
;  @ REPL[82]:2 within `kernel'
; ┌ @ reshapedarray.jl:234 within `getindex'
; │┌ @ reshapedarray.jl:244 within `_unsafe_getindex'
; ││┌ @ abstractarray.jl:1894 within `_sub2ind'
; │││┌ @ abstractarray.jl:1910 within `_sub2ind_recurse'
; ││││┌ @ tuple.jl:24 within `getindex'
       %1 = getelementptr inbounds { { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } }, { { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } }* %0, i64 0, i32 0, i32 1, i64 0
; ││││└
; ││││ @ abstractarray.jl:1911 within `_sub2ind_recurse'
; ││││┌ @ abstractarray.jl:1914 within `nextL'
; │││││┌ @ int.jl:87 within `*'
        %2 = load i64, i64* %1, align 8
; ││└└└└
; ││ @ reshapedarray.jl:245 within `_unsafe_getindex'
; ││┌ @ reshapedarray.jl:218 within `ind2sub_rs'
; │││┌ @ reshapedarray.jl:221 within `_ind2sub_rs'
; ││││┌ @ multinverses.jl:152 within `divrem'
; │││││┌ @ multinverses.jl:138 within `div'
; ││││││┌ @ operators.jl:818 within `widen'
; │││││││┌ @ number.jl:7 within `convert'
; ││││││││┌ @ boot.jl:709 within `Int128'
; │││││││││┌ @ boot.jl:639 within `toInt128'
            %3 = sext i64 %2 to i128
; ││││││└└└└
; ││││││┌ @ Base.jl:33 within `getproperty'
         %4 = getelementptr inbounds { { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } }, { { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } }* %0, i64 0, i32 0, i32 2, i64 0, i32 1
; ││││││└
; ││││││┌ @ int.jl:920 within `*'
; │││││││┌ @ int.jl:476 within `rem'
          %5 = load i64, i64* %4, align 8
          %6 = sext i64 %5 to i128
; │││││││└
; │││││││ @ int.jl:922 within `*' @ int.jl:907
         %7 = mul nsw i128 %6, %3
; ││││││└
; ││││││┌ @ int.jl:465 within `>>>' @ int.jl:457
         %8 = lshr i128 %7, 64
; ││││││└
; ││││││┌ @ int.jl:471 within `rem'
         %9 = trunc i128 %8 to i64

This is OK for platforms that know how to handle this, e.g., PTX decomposes the i128 into two i64's:

*** IR Dump After NVPTX specific alloca hoisting ***
define dso_local ptx_kernel void @_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE({ [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] }* nocapture nonnull readonly byval dereferenceable(64) %0) local_unnamed_addr {
top:
  %1 = addrspacecast { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] }* %0 to { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } addrspace(101)*
  %2 = load { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] }, { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } addrspace(101)* %1
  %.fca.0.0.0.0.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 0, 0, 0, 0
  %.fca.0.0.1.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 0, 0, 1
  %.fca.1.0.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 1, 0
  %.fca.2.0.0.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 2, 0, 0
  %.fca.2.0.1.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 2, 0, 1
  %.fca.2.0.2.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 2, 0, 2
  %.fca.2.0.3.extract = extractvalue { [1 x { [2 x i64], i8 addrspace(1)* }], [2 x i64], [1 x { i64, i64, i8, i8 }] } %2, 2, 0, 3
  %3 = sext i64 %.fca.1.0.extract to i128
  %4 = sext i64 %.fca.2.0.1.extract to i128
  %5 = mul nsw i128 %4, %3
  %6 = lshr i128 %5, 64
  %7 = trunc i128 %6 to i64
  %8 = sext i8 %.fca.2.0.2.extract to i64
  %9 = mul i64 %.fca.1.0.extract, %8
  %10 = add i64 %9, %7
  %11 = icmp slt i64 %.fca.2.0.0.extract, 0
  %12 = sub i64 0, %.fca.2.0.0.extract
  %13 = select i1 %11, i64 %12, i64 %.fca.2.0.0.extract
  %14 = icmp eq i64 %13, 1
  %15 = mul i64 %.fca.2.0.0.extract, %.fca.1.0.extract
  %16 = icmp ult i8 %.fca.2.0.3.extract, 63
  %narrow = select i1 %16, i8 %.fca.2.0.3.extract, i8 63
  %.v = zext i8 %narrow to i64
  %17 = ashr i64 %10, %.v
  %.lobit = lshr i64 %10, 63
  %18 = add i64 %17, %.lobit
  %19 = select i1 %14, i64 %15, i64 %18
  %20 = mul i64 %19, %.fca.2.0.0.extract
  %21 = sub i64 %.fca.1.0.extract, %20
  %22 = icmp sgt i64 %.fca.0.0.0.0.extract, 0
  %23 = select i1 %22, i64 %.fca.0.0.0.0.extract, i64 0
  %24 = mul i64 %21, %23
  %25 = add i64 %24, %19
  %26 = bitcast i8 addrspace(1)* %.fca.0.0.1.extract to float addrspace(1)*
  %27 = getelementptr inbounds float, float addrspace(1)* %26, i64 %25
  %28 = load float, float addrspace(1)* %27, align 4
  %29 = fadd float %28, 1.000000e+00
  store float %29, float addrspace(1)* %27, align 4
  ret void
}
# *** IR Dump After NVPTX DAG->DAG Pattern Instruction Selection ***:
# Machine code for function _Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE: IsSSA, TracksLiveness

bb.0.top:
  %0:int64regs = LD_i64_asi 0, 4, 1, 0, 64, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 16 :: (dereferenceable load 8 from %ir.1 + 16, addrspace 101)
  %1:int64regs = LD_i64_avar 0, 4, 1, 0, 64, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0 :: (dereferenceable load 8 from %ir.1, addrspace 101)
  %2:int16regs = LD_i16_asi 0, 4, 1, 0, 8, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 57 :: (dereferenceable load 1 from %ir.1 + 57, addrspace 101)
  %3:int64regs = LD_i64_asi 0, 4, 1, 0, 64, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 40 :: (dereferenceable load 8 from %ir.1 + 40, addrspace 101)
  %4:int64regs = LD_i64_asi 0, 4, 1, 1, 8, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 56 :: (dereferenceable load 1 from %ir.1 + 56, addrspace 101)
  %5:int64regs = LD_i64_asi 0, 4, 1, 0, 64, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 24 :: (dereferenceable load 8 from %ir.1 + 24, addrspace 101)
  %6:int64regs = LD_i64_asi 0, 4, 1, 0, 64, &_Z17julia_kernel_387613ReshapedArrayI7Float32Li2E7AdjointIS0_13CuDeviceArrayIS0_Li2ELi1EEE5TupleI27SignedMultiplicativeInverseI5Int64EEE_param_0, 48 :: (dereferenceable load 8 from %ir.1 + 48, addrspace 101)
  %7:int64regs = MULTHSi64rr killed %6:int64regs, %5:int64regs
  %8:int64regs = MULTi64rr %5:int64regs, killed %4:int64regs
  %9:int64regs = ADDi64rr killed %8:int64regs, killed %7:int64regs
  %10:int64regs = anonymous_2233 %3:int64regs
  %11:int1regs = SETP_s64ri killed %10:int64regs, 1, 0
  %12:int64regs = MULTi64rr %3:int64regs, %5:int64regs
  %13:int16regs = UMINi16ri killed %2:int16regs, 63
  %14:int32regs = CVT_u32_u16 killed %13:int16regs, 0
  %15:int64regs = SRAi64rr %9:int64regs, killed %14:int32regs
  %16:int64regs = SRLi64ri %9:int64regs, 63
  %17:int64regs = ADDi64rr killed %15:int64regs, killed %16:int64regs
  %18:int64regs = SELP_b64rr killed %12:int64regs, killed %17:int64regs, killed %11:int1regs
  %19:int64regs = MULTi64rr %18:int64regs, %3:int64regs
  %20:int64regs = SUBi64rr %5:int64regs, killed %19:int64regs
  %21:int64regs = SMAXi64ri killed %1:int64regs, 0
  %22:int64regs = MULTi64rr killed %20:int64regs, killed %21:int64regs
  %23:int64regs = ADDi64rr killed %22:int64regs, %18:int64regs
  %24:int64regs = SHLi64ri killed %23:int64regs, 2
  %25:int64regs = ADDi64rr killed %0:int64regs, killed %24:int64regs
  %26:float32regs = LD_f32_areg_64 0, 1, 1, 2, 32, %25:int64regs :: (load 4 from %ir.27, addrspace 1)
  %27:float32regs = FADD_rnf32ri killed %26:float32regs, float 1.000000e+00
  ST_f32_areg_64 killed %27:float32regs, 0, 1, 1, 2, 32, %25:int64regs :: (store 4 into %ir.27, addrspace 1)
  Return

On SPIR-V, not so much:

       %u128 = OpTypeInt 128 0
...
         %31 = OpLoad %ulong %30 Aligned 8
         %32 = OpSConvert %u128 %31
         %33 = OpIMul %u128 %32 %28
         %35 = OpShiftRightLogical %u128 %33 %u128_

and this is not a legal type

┌ Error: Module compilation failed:
│ 
│ error: undefined reference to `__builtin_spirv_OpSConvert_i128_i64()'
│ undefined reference to `__builtin_spirv_OpUConvert_i64_i128()'
│ 
│ error: backend compiler failed build.
└ @ oneAPI.oneL0 ~/Julia/pkg/oneAPI/lib/level-zero/module.jl:49
ERROR: ZeError: error occurred when building module, see build log for details (code 1879048196, ZE_RESULT_ERROR_MODULE_BUILD_FAILURE)

Thread local storage is not implemented error message

I have a fairly complicated kernel that uses AdvancedHMC and KernelAbstractions. Executing the kernel is giving me this confusing error:

Thread local storage is not implemented
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] lower_ptls!(::LLVM.Module) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/optim.jl:152
 [3] (::LLVM.var"#callback#118"{typeof(GPUCompiler.lower_ptls!)})(::Ptr{Nothing}) at /home/seth/.julia/packages/LLVM/T8ZBA/src/pass.jl:22
 [4] macro expansion at /home/seth/.julia/packages/LLVM/T8ZBA/src/util.jl:109 [inlined]
 [5] LLVMRunPassManager at /home/seth/.julia/packages/LLVM/T8ZBA/lib/libLLVM_h.jl:2813 [inlined]
 [6] run! at /home/seth/.julia/packages/LLVM/T8ZBA/src/passmanager.jl:36 [inlined]
 [7] (::GPUCompiler.var"#52#57"{LLVM.Module,GPUCompiler.var"#initialize!#55"{LLVM.Module,LLVM.TargetMachine}})(::LLVM.ModulePassManager) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/optim.jl:40
 [8] LLVM.ModulePassManager(::GPUCompiler.var"#52#57"{LLVM.Module,GPUCompiler.var"#initialize!#55"{LLVM.Module,LLVM.TargetMachine}}) at /home/seth/.julia/packages/LLVM/T8ZBA/src/passmanager.jl:30
 [9] optimize!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/optim.jl:27
 [10] macro expansion at /home/seth/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
 [11] macro expansion at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/driver.jl:111 [inlined]
 [12] macro expansion at /home/seth/.julia/packages/TimerOutputs/dVnaw/src/TimerOutput.jl:206 [inlined]
 [13] codegen(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/driver.jl:100
 [14] compile(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/driver.jl:39
 [15] compile at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/driver.jl:35 [inlined]
 [16] _cufunction(::GPUCompiler.FunctionSpec{typeof(Cassette.overdub),Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.DynamicCheck,Nothing,CartesianIndices{1,Tuple{Base.OneTo{Int64}}},KernelAbstractions.NDIteration.NDRange{1,KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.StaticSize{(256,)},CartesianIndices{1,Tuple{Base.OneTo{Int64}}},Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(gpu_transition_kernel!),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}},MersenneTwister,Array{UInt64,1},StaticTrajectory{MultinomialTS,Leapfrog{Float64}},Hamiltonian{DiagEuclideanMetric{Float64,CuArray{Float64,2}},typeof(lpvec),typeof(∇lpvec)},typeof(lp),typeof(∇lp),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}}}; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/seth/.julia/packages/CUDA/dZvbp/src/compiler/execution.jl:310
 [17] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Cassette.overdub),Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.DynamicCheck,Nothing,CartesianIndices{1,Tuple{Base.OneTo{Int64}}},KernelAbstractions.NDIteration.NDRange{1,KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.StaticSize{(256,)},CartesianIndices{1,Tuple{Base.OneTo{Int64}}},Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(gpu_transition_kernel!),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}},MersenneTwister,Array{UInt64,1},StaticTrajectory{MultinomialTS,Leapfrog{Float64}},Hamiltonian{DiagEuclideanMetric{Float64,CuArray{Float64,2}},typeof(lpvec),typeof(∇lpvec)},typeof(lp),typeof(∇lp),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/cache.jl:24
 [18] gpu_transition_kernel! at ./none:0 [inlined]
 [19] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Cassette.overdub),Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.DynamicCheck,Nothing,CartesianIndices{1,Tuple{Base.OneTo{Int64}}},KernelAbstractions.NDIteration.NDRange{1,KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.StaticSize{(256,)},CartesianIndices{1,Tuple{Base.OneTo{Int64}}},Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(gpu_transition_kernel!),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}},MersenneTwister,Array{UInt64,1},StaticTrajectory{MultinomialTS,Leapfrog{Float64}},Hamiltonian{DiagEuclideanMetric{Float64,CuArray{Float64,2}},typeof(lpvec),typeof(∇lpvec)},typeof(lp),typeof(∇lp),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/seth/.julia/packages/GPUCompiler/GKp4B/src/cache.jl:0
 [20] cufunction(::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(CUDACtx),KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.DynamicCheck,Nothing,CartesianIndices{1,Tuple{Base.OneTo{Int64}}},KernelAbstractions.NDIteration.NDRange{1,KernelAbstractions.NDIteration.DynamicSize,KernelAbstractions.NDIteration.StaticSize{(256,)},CartesianIndices{1,Tuple{Base.OneTo{Int64}}},Nothing}},Nothing,KernelAbstractions.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(gpu_transition_kernel!),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}},MersenneTwister,Array{UInt64,1},StaticTrajectory{MultinomialTS,Leapfrog{Float64}},Hamiltonian{DiagEuclideanMetric{Float64,CuArray{Float64,2}},typeof(lpvec),typeof(∇lpvec)},typeof(lp),typeof(∇lp),AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}}}; name::String, kwargs::Base.Iterators.Pairs{Symbol,Int64,Tuple{Symbol},NamedTuple{(:maxthreads,),Tuple{Int64}}}) at /home/seth/.julia/packages/CUDA/dZvbp/src/compiler/execution.jl:298
 [21] macro expansion at /home/seth/.julia/packages/CUDA/dZvbp/src/compiler/execution.jl:109 [inlined]
 [22] (::KernelAbstractions.Kernel{CUDADevice,KernelAbstractions.NDIteration.StaticSize{(256,)},KernelAbstractions.NDIteration.DynamicSize,typeof(gpu_transition_kernel!)})(::AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}, ::Vararg{Any,N} where N; ndrange::Int64, dependencies::Nothing, workgroupsize::Nothing, progress::Function) at /home/seth/.julia/packages/KernelAbstractions/ppJ7l/src/backends/cuda.jl:185
 [23] transition! at ./In[15]:39 [inlined]
 [24] mytransition(::MersenneTwister, ::StaticTrajectory{MultinomialTS,Leapfrog{Float64}}, ::Hamiltonian{DiagEuclideanMetric{Float64,CuArray{Float64,2}},typeof(lpvec),typeof(∇lpvec)}, ::Function, ::Function, ::AdvancedHMC.PhasePoint{CuArray{Float64,2},AdvancedHMC.DualValue{CuArray{Float64,1},CuArray{Float64,2}}}) at ./In[15]:44
 [25] top-level scope at ./timing.jl:174 [inlined]
 [26] top-level scope at ./In[16]:0
 [27] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
 [28] execute_code(::String, ::String) at /home/seth/.julia/packages/IJulia/rWZ9e/src/execute_request.jl:27
 [29] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/seth/.julia/packages/IJulia/rWZ9e/src/execute_request.jl:86
 [30] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [31] invokelatest at ./essentials.jl:709 [inlined]
 [32] eventloop(::ZMQ.Socket) at /home/seth/.julia/packages/IJulia/rWZ9e/src/eventloop.jl:8
 [33] (::IJulia.var"#15#18")() at ./task.jl:356

On slack, @vchuravy said

but @maleadt I would expect the error message to be better and give a backtrace to the user code that caused this.

and @maleadt said

yeah the validator should probably have a pattern for this, instead of just erroring in the optimizer

Reflection macros do not perform validation

Kind of a silly error, but:

julia> function kernel(A)
         r_rhsW[1] -= A[1]
         nothing
       end
kernel (generic function with 1 method)

julia> CUDA.@device_code dir="bug" @cuda kernel(CUDA.zeros(10))
ERROR: LLVM error: Cannot select: 0x96c8458: ch = AtomicStore<(store release 8 into @0, addrspace 1)> 0x96c8938:1, 0x96c7e40, 0x96c8938
  0x96c7e40: i64 = NVPTXISD::Wrapper TargetGlobalAddress:i64<{}* addrspace(1)* @0> 0
    0x96c8250: i64 = TargetGlobalAddress<{}* addrspace(1)* @0> 0
  0x96c8938: i64,ch,glue = NVPTXISD::ProxyReg 0x96c8a70, 0x96c8590, 0x96c8a70:1
    0x96c8590: i64,ch,glue = NVPTXISD::LoadParam<(load 8)> 0x96c8388, Constant:i32<1>, Constant:i32<0>, 0x96c8388:1
      0x96c8118: i32 = Constant<1>
      0x96c7fe0: i32 = Constant<0>
      0x96c8388: ch,glue = NVPTXISD::CallArgEnd 0x96c8660, Constant:i32<1>, 0x96c8660:1
        0x96c8118: i32 = Constant<1>
        0x96c8660: ch,glue = NVPTXISD::LastCallArg 0x96c84c0, Constant:i32<1>, Constant:i32<1>, 0x96c84c0:1
          0x96c8118: i32 = Constant<1>
          0x96c8118: i32 = Constant<1>
          0x96c84c0: ch,glue = NVPTXISD::CallArg 0x96c88d0, Constant:i32<1>, Constant:i32<0>, 0x96c88d0:1
            0x96c8118: i32 = Constant<1>
            0x96c7fe0: i32 = Constant<0>
            0x96c88d0: ch,glue = NVPTXISD::CallArgBegin 0x96c8800, 0x96c8800:1
              0x96c8800: ch,glue = NVPTXISD::CallVoid 0x96c89a0, 0x96c81e8, 0x96c89a0:1
                0x96c81e8: i64 = NVPTXISD::Wrapper TargetGlobalAddress:i64<{}* ({}*, {}*)* @jl_get_binding_or_error> 0
                  0x96c8528: i64 = TargetGlobalAddress<{}* ({}*, {}*)* @jl_get_binding_or_error> 0
                0x96c89a0: ch,glue = NVPTXISD::PrintCallUni 0x96c8a08, Constant:i32<1>, 0x96c8a08:1
                  0x96c8118: i32 = Constant<1>
                  0x96c8a08: ch,glue = NVPTXISD::DeclareRet 0x96c8ad8, Constant:i32<1>, Constant:i32<64>, Constant:i32<0>, 0x96c8ad8:1




    0x96c8a70: ch,glue = callseq_end 0x96c8590:1, TargetConstant:i64<524>, TargetConstant:i64<525>, 0x96c8590:2
      0x96c8ba8: i64 = TargetConstant<524>
      0x96c7f78: i64 = TargetConstant<525>
      0x96c8590: i64,ch,glue = NVPTXISD::LoadParam<(load 8)> 0x96c8388, Constant:i32<1>, Constant:i32<0>, 0x96c8388:1
        0x96c8118: i32 = Constant<1>
        0x96c7fe0: i32 = Constant<0>
        0x96c8388: ch,glue = NVPTXISD::CallArgEnd 0x96c8660, Constant:i32<1>, 0x96c8660:1
          0x96c8118: i32 = Constant<1>
          0x96c8660: ch,glue = NVPTXISD::LastCallArg 0x96c84c0, Constant:i32<1>, Constant:i32<1>, 0x96c84c0:1
            0x96c8118: i32 = Constant<1>
            0x96c8118: i32 = Constant<1>
            0x96c84c0: ch,glue = NVPTXISD::CallArg 0x96c88d0, Constant:i32<1>, Constant:i32<0>, 0x96c88d0:1
              0x96c8118: i32 = Constant<1>
              0x96c7fe0: i32 = Constant<0>
              0x96c88d0: ch,glue = NVPTXISD::CallArgBegin 0x96c8800, 0x96c8800:1
                0x96c8800: ch,glue = NVPTXISD::CallVoid 0x96c89a0, 0x96c81e8, 0x96c89a0:1
                  0x96c81e8: i64 = NVPTXISD::Wrapper TargetGlobalAddress:i64<{}* ({}*, {}*)* @jl_get_binding_or_error> 0

                  0x96c89a0: ch,glue = NVPTXISD::PrintCallUni 0x96c8a08, Constant:i32<1>, 0x96c8a08:1


In function: _Z17julia_kernel_656713CuDeviceArrayI7Float32Li1ELi1EE

On Julia 1.6-rc1, Silly since the verifier will just yell at me when I run the code normally.

Calling device function with CC from kernel with CC results in trap

; ModuleID = 'reduce.bc'
source_filename = "reduce.ll"

define ptx_device void @child() {
  ret void
}

define ptx_kernel void @parent() {
  call void @child()
  ret void
}

Gets optimized to a trap: instcombine makes it a store i1 true, i1* undef, align 1 after which simplifycfg reduces it to trap; unsure why.

Decoding call to jl_apply_generic fails if arguments are stored in alloca

From: vchuravy/GPUifyLoops.jl#58 (comment)

Looks like the issue is that calls are first stored in an alloca.

┌ Warning: Decoding arguments to jl_apply_generic failed, please file a bug with a reproducer.
│   inst =   %8 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** nonnull %.sub, i32 1), !dbg !43
│   bb =
│
│    top:
│      %1 = alloca [2 x %jl_value_t addrspace(10)*], align 8
│      %.sub = getelementptr inbounds [2 x %jl_value_t addrspace(10)*], [2 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 0
│      %2 = call fastcc %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 8), !dbg !43
│      %3 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !43
│      %4 = bitcast %jl_value_t addrspace(10)* %2 to i64 addrspace(10)*, !dbg !43
│      %5 = load i64, i64 addrspace(11)* %3, align 8, !dbg !43, !tbaa !44, !invariant.load !4
│      store i64 %5, i64 addrspace(10)* %4, align 8, !dbg !43, !tbaa !47
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140429411225104 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %.sub, align 8, !dbg !43
│      %6 = getelementptr inbounds [2 x %jl_value_t addrspace(10)*], [2 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 1, !dbg !43
│      store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %6, align 8, !dbg !43
│      %7 = call nonnull %jl_value_t addrspace(10)* @jl_f_apply_type(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* null to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** nonnull %.sub, i32 2), !dbg !43
│      store %jl_value_t addrspace(10)* %7, %jl_value_t addrspace(10)** %.sub, align 8, !dbg !43
│      %8 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** nonnull %.sub, i32 1), !dbg !43
│      ret %jl_value_t addrspace(10)* %8, !dbg !43
│
└ @ CUDAnative ~/.julia/packages/CUDAnative/wU0tS/src/compiler/validation.jl:213
┌ Warning: Decoding arguments to jl_apply_generic failed, please file a bug with a reproducer.
│   inst =   %8 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** nonnull %.sub, i32 1), !dbg !43
│   bb =
│
│    top:
│      %1 = alloca [2 x %jl_value_t addrspace(10)*], align 8
│      %.sub = getelementptr inbounds [2 x %jl_value_t addrspace(10)*], [2 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 0
│      %2 = call fastcc %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 8), !dbg !43
│      %3 = getelementptr inbounds [1 x i64], [1 x i64] addrspace(11)* %0, i64 0, i64 0, !dbg !43
│      %4 = bitcast %jl_value_t addrspace(10)* %2 to i64 addrspace(10)*, !dbg !43
│      %5 = load i64, i64 addrspace(11)* %3, align 8, !dbg !43, !tbaa !44, !invariant.load !4
│      store i64 %5, i64 addrspace(10)* %4, align 8, !dbg !43, !tbaa !47
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140429411225104 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %.sub, align 8, !dbg !43
│      %6 = getelementptr inbounds [2 x %jl_value_t addrspace(10)*], [2 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 1, !dbg !43
│      store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %6, align 8, !dbg !43
│      %7 = call nonnull %jl_value_t addrspace(10)* @jl_f_apply_type(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* null to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** nonnull %.sub, i32 2), !dbg !43
│      store %jl_value_t addrspace(10)* %7, %jl_value_t addrspace(10)** %.sub, align 8, !dbg !43
│      %8 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** nonnull %.sub, i32 1), !dbg !43
│      ret %jl_value_t addrspace(10)* %8, !dbg !43
│
└ @ CUDAnative ~/.julia/packages/CUDAnative/wU0tS/src/compiler/validation.jl:213

"Unexpected internal error" caused by mapping over CuArray

I started experimenting with CUDA.jl yesterday and managed to produce an error with a simple example. The error message requested that I file an issue here. My apologies if this is caused by my doing something naive -- I'm just following orders! A minimal example that produces the error is this:

module MinimalFailure

using CUDA

function badfunc!(i::Int, results::CuArray{Float32,1})
    results[i] = 0.0f0
    return
end

n=1024
data = CuArray{Float32,1}(undef,n)
idcs = CuArray(collect(1:n))
map(i -> badfunc!(i,data), idcs)

end

The resulting error message, system info and backtrace are below:

WARNING: replacing module MinimalFailure.
ERROR: LoadError: GPUCompiler.jl encountered an unexpected internal error.
Please file an issue attaching the following information, including the backtrace,
as well as a reproducible example (if possible).

InternalCompilerError: length(frames) == 1, at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:246

Compiler invocation: PTX CompilerJob of kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Nothing,1,1}, Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},Main.MinimalFailure.var"#1#2",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Int64,1,1},Tuple{Bool},Tuple{Int64}}}}, Int64) for sm_61

Additional information:
 - frames = svec(svec(:jl_symbol_name_, Symbol("/buildworker/worker/package_linux64/build/src/julia.h"), 970, nothing, true, true), svec(:jl_symbol_name, Symbol("/buildworker/worker/package_linux64/build/src/rtutils.c"), 333, nothing, true, false))

Installed packages:
 - CommonSolve = v"0.2.0"
 - DelNetExperiment = v"0.1.0"
 - Requires = v"1.1.3"
 - Gtk = v"1.1.7"
 - GPUArrays = v"6.2.1"
 - ArrayLayouts = v"0.3.8"
 - CommonSubexpressions = v"0.3.0"
 - IteratorInterfaceExtensions = v"1.0.0"
 - Xorg_libXtst_jll = v"1.2.3+4"
 - Graphics = v"1.1.0"
 - GeometryBasics = v"0.2.15"
 - CodeTracking = v"1.0.5"
 - Makie = v"0.11.0"
 - Libiconv_jll = v"1.16.0+7"
 - JuliaVariables = v"0.2.4"
 - NLSolversBase = v"7.5.0"
 - FileIO = v"1.6.5"
 - Base64 = nothing
 - GLM = v"1.4.1"
 - Wayland_protocols_jll = v"1.18.0+4"
 - Colors = v"0.12.7"
 - IndirectArrays = v"0.5.1"
 - PyPlot = v"2.9.0"
 - Profile = nothing
 - PDMats = v"0.10.1"
 - WAV = v"1.1.0"
 - LibGit2 = nothing
 - PyCall = v"1.92.2"
 - VectorizationBase = v"0.16.2"
 - Graphene_jll = v"1.10.0+2"
 - IfElse = v"0.1.0"
 - OrderedCollections = v"1.4.0"
 - TimerOutputs = v"0.5.8"
 - LibVPX_jll = v"1.9.0+1"
 - Tables = v"1.4.1"
 - StatsMakie = v"0.2.3"
 - OpenSSL_jll = v"1.1.1+6"
 - ImageCore = v"0.8.21"
 - VersionParsing = v"1.2.0"
 - Ogg_jll = v"1.3.4+2"
 - CairoMakie = v"0.3.4"
 - Juno = v"0.8.4"
 - ColorVectorSpace = v"0.8.7"
 - LeftChildRightSiblingTrees = v"0.1.2"
 - CodeTools = v"0.7.1"
 - MakieLayout = v"0.9.10"
 - LNR = v"0.2.1"
 - Libepoxy_jll = v"1.5.4+1"
 - libvorbis_jll = v"1.3.6+6"
 - RecursiveFactorization = v"0.1.11"
 - MosaicViews = v"0.2.4"
 - GPUCompiler = v"0.8.3"
 - VertexSafeGraphs = v"0.1.2"
 - Xorg_xkbcomp_jll = v"1.4.2+4"
 - Combinatorics = v"1.0.2"
 - Zlib_jll = v"1.2.11+18"
 - Markdown = nothing
 - Xorg_libXcomposite_jll = v"0.4.5+4"
 - NNlib = v"0.7.11"
 - Formatting = v"0.4.2"
 - GTK3_jll = v"3.24.11+5"
 - Adapt = v"3.2.0"
 - Opus_jll = v"1.3.1+3"
 - Plots = v"1.11.2"
 - SharedArrays = nothing
 - Dbus_jll = v"1.12.16+3"
 - ProgressMeter = v"1.5.0"
 - Pidfile = v"1.2.0"
 - Missings = v"0.4.5"
 - Xorg_libXext_jll = v"1.3.4+4"
 - AssetRegistry = v"0.1.0"
 - StructArrays = v"0.4.2"
 - SpecialFunctions = v"0.10.3"
 - Observables = v"0.3.3"
 - Libtiff_jll = v"4.1.0+2"
 - RandomExtensions = v"0.4.3"
 - FuzzyCompletions = v"0.4.1"
 - DocSeeker = v"0.4.3"
 - Blosc_jll = v"1.14.3+1"
 - IOCapture = v"0.1.1"
 - ModelingToolkit = v"3.20.1"
 - GeneralizedGenerated = v"0.2.8"
 - FunctionWrappers = v"1.1.2"
 - Xorg_libXinerama_jll = v"1.1.4+4"
 - FunctionalCollections = v"0.5.0"
 - MacroTools = v"0.5.6"
 - EzXML = v"1.1.0"
 - ImageIO = v"0.2.0"
 - DiffEqCallbacks = v"2.13.5"
 - WoodburyMatrices = v"0.5.3"
 - WebIO = v"0.8.15"
 - NameResolution = v"0.1.5"
 - Test = nothing
 - PkgTemplates = v"0.7.16"
 - hicolor_icon_theme_jll = v"0.17.0+3"
 - Hexagons = v"0.2.0"
 - DataFrames = v"0.21.8"
 - BFloat16s = v"0.1.0"
 - JuliaInterpreter = v"0.8.11"
 - ThreadingUtilities = v"0.2.3"
 - FixedPointNumbers = v"0.8.4"
 - OrdinaryDiffEq = v"5.39.1"
 - LZO_jll = v"2.10.0+3"
 - FFMPEG_jll = v"4.3.1+4"
 - SHA = nothing
 - ExprTools = v"0.1.3"
 - TableTraitsUtils = v"1.0.1"
 - at_spi2_atk_jll = v"2.34.1+4"
 - xkbcommon_jll = v"0.9.1+5"
 - Hwloc = v"1.3.0"
 - RecursiveArrayTools = v"2.11.1"
 - LoopVectorization = v"0.10.0"
 - Xorg_libXfixes_jll = v"5.0.3+4"
 - Xorg_libXcursor_jll = v"1.2.0+4"
 - Parsers = v"1.1.0"
 - FreeTypeAbstraction = v"0.8.4"
 - ZygoteRules = v"0.2.1"
 - FlameGraphs = v"0.2.5"
 - Serialization = nothing
 - ForwardDiff = v"0.10.17"
 - MLStyle = v"0.4.9"
 - Xorg_libXdmcp_jll = v"1.1.3+4"
 - Grisu = v"1.0.0"
 - Fontconfig_jll = v"2.13.1+14"
 - libpng_jll = v"1.6.37+6"
 - Qt5Base_jll = v"5.15.2+0"
 - PositiveFactorizations = v"0.2.4"
 - Hiccup = v"0.2.2"
 - Xorg_xtrans_jll = v"1.4.0+3"
 - FillArrays = v"0.8.14"
 - QuadGK = v"2.4.1"
 - Zstd_jll = v"1.4.8+0"
 - MbedTLS = v"1.0.3"
 - AbstractFFTs = v"1.0.1"
 - JLLWrappers = v"1.2.0"
 - adwaita_icon_theme_jll = v"3.33.92+5"
 - ExponentialUtilities = v"1.8.1"
 - Future = nothing
 - Artifacts = v"1.3.0"
 - Xorg_xcb_util_jll = v"0.4.0+1"
 - PaddedViews = v"0.5.8"
 - Mmap = nothing
 - ParameterizedFunctions = v"5.6.0"
 - SignedDistanceFields = v"0.4.0"
 - FFMPEG = v"0.3.0"
 - Random123 = v"1.3.1"
 - Sockets = nothing
 - Xorg_libXrandr_jll = v"1.5.2+4"
 - SparseDiffTools = v"1.13.0"
 - Lazy = v"0.15.1"
 - ShiftedArrays = v"1.0.0"
 - LazyArtifacts = v"1.3.0"
 - CEnum = v"0.4.1"
 - DataAPI = v"1.6.0"
 - Statistics = nothing
 - Cairo_jll = v"1.16.0+6"
 - DiffRules = v"1.0.2"
 - CompilerSupportLibraries_jll = v"0.3.4+0"
 - METIS_jll = v"5.1.0+5"
 - DataValues = v"0.4.13"
 - NLsolve = v"4.5.1"
 - AxisAlgorithms = v"1.0.0"
 - Expat_jll = v"2.2.7+6"
 - DiffEqFinancial = v"2.4.0"
 - Sundials_jll = v"5.2.0+1"
 - CategoricalArrays = v"0.8.3"
 - AbstractTrees = v"0.3.4"
 - GenericSVD = v"0.3.0"
 - Contour = v"0.5.7"
 - Logging = nothing
 - AuditoryFilters = v"0.1.0"
 - PrettyPrint = v"0.2.0"
 - LoggingExtras = v"0.4.6"
 - ArnoldiMethod = v"0.1.0"
 - MultiScaleArrays = v"1.8.1"
 - GR_jll = v"0.57.1+0"
 - SteadyStateDiffEq = v"1.6.2"
 - CanonicalTraits = v"0.2.4"
 - libfdk_aac_jll = v"0.1.6+4"
 - Dates = nothing
 - ChainRulesCore = v"0.9.37"
 - LAME_jll = v"3.100.0+3"
 - Showoff = v"0.3.2"
 - gdk_pixbuf_jll = v"2.38.2+9"
 - Blosc = v"0.7.0"
 - FFTW_jll = v"3.3.9+7"
 - Optim = v"0.20.1"
 - HTTP = v"0.8.19"
 - Pkg = nothing
 - Xorg_libxkbfile_jll = v"1.1.0+4"
 - Interpolations = v"0.12.10"
 - FoldingTrees = v"1.0.1"
 - GLFW_jll = v"3.3.3+0"
 - Match = v"1.1.0"
 - SLEEFPirates = v"0.6.8"
 - Lz4_jll = v"1.9.2+2"
 - LightGraphs = v"1.3.5"
 - HDF5_jll = v"1.12.0+1"
 - Xorg_libX11_jll = v"1.6.9+4"
 - DataStructures = v"0.17.20"
 - Rmath_jll = v"0.2.2+2"
 - StatsFuns = v"0.9.7"
 - DataFramesMeta = v"0.6.0"
 - FiniteDiff = v"2.8.0"
 - Libffi_jll = v"3.2.1+4"
 - Compat = v"2.2.1"
 - Documenter = v"0.26.3"
 - Conda = v"1.5.1"
 - REPL = nothing
 - ATK_jll = v"2.34.1+5"
 - Xorg_libXdamage_jll = v"1.1.5+4"
 - LoweredCodeUtils = v"1.2.9"
 - AbstractPlotting = v"0.11.2"
 - FileWatching = nothing
 - MKL_jll = v"2021.1.1+1"
 - DocStringExtensions = v"0.8.4"
 - Animations = v"0.4.1"
 - Random = nothing
 - DimensionalPlotRecipes = v"1.2.0"
 - OpenSpecFun_jll = v"0.5.3+4"
 - Libglvnd_jll = v"1.3.0+3"
 - Distributions = v"0.23.8"
 - Libdl = nothing
 - BandedMatrices = v"0.15.15"
 - Xorg_xcb_util_keysyms_jll = v"0.4.0+1"
 - OpenBLAS_jll = v"0.3.9+5"
 - PooledArrays = v"0.5.3"
 - WebSockets = v"1.5.9"
 - MFCC = v"0.3.1"
 - DiffEqPhysics = v"3.9.0"
 - Xorg_xcb_util_renderutil_jll = v"0.3.9+1"
 - Xorg_libXrender_jll = v"0.9.10+4"
 - StimAnaGen = v"0.1.0"
 - DelimitedFiles = nothing
 - ResettableStacks = v"1.1.0"
 - SparseArrays = nothing
 - Ratios = v"0.4.0"
 - InvertedIndices = v"1.0.0"
 - Graphite2_jll = v"1.3.13+4"
 - PCRE_jll = v"8.42.0+4"
 - Roots = v"1.0.8"
 - UnPack = v"1.0.2"
 - Intervals = v"1.5.0"
 - SimpleTraits = v"0.9.3"
 - LaTeXStrings = v"1.2.1"
 - PyPeriModels = v"0.1.0"
 - Measures = v"0.3.1"
 - Glib_jll = v"2.59.0+4"
 - GLMakie = v"0.1.5"
 - Xorg_libXau_jll = v"1.0.9+4"
 - GridLayoutBase = v"0.3.7"
 - x264_jll = v"2020.7.14+2"
 - DifferentialEquations = v"6.16.0"
 - PlotUtils = v"1.0.10"
 - Libuuid_jll = v"2.34.0+7"
 - Atom = v"0.12.30"
 - StringDistances = v"0.10.0"
 - RecipesBase = v"1.1.1"
 - GLFW = v"3.4.1"
 - DelayDiffEq = v"5.24.1"
 - XSLT_jll = v"1.1.33+4"
 - LabelledArrays = v"1.6.0"
 - Mustache = v"1.0.10"
 - Cthulhu = v"1.6.1"
 - Pandas = v"1.4.0"
 - JpegTurbo_jll = v"2.0.1+3"
 - ImageMagick_jll = v"6.9.10-12+3"
 - ConsoleProgressMonitor = v"0.1.2"
 - Revise = v"3.1.14"
 - SafeTestsets = v"0.0.1"
 - XML2_jll = v"2.9.10+3"
 - Libgcrypt_jll = v"1.8.5+4"
 - DSP = v"0.6.10"
 - DataValueInterfaces = v"1.0.0"
 - Crayons = v"4.0.4"
 - AbstractAlgebra = v"0.11.2"
 - ColorSchemes = v"3.11.0"
 - SymbolicUtils = v"0.5.2"
 - SciMLBase = v"1.10.1"
 - Libgpg_error_jll = v"1.36.0+3"
 - DiffResults = v"1.0.3"
 - JuliaFormatter = v"0.12.3"
 - Unitful = v"1.7.0"
 - FreeType = v"3.0.1"
 - IniFile = v"0.5.0"
 - x265_jll = v"3.0.0+3"
 - TableTraits = v"1.0.1"
 - LinearAlgebra = nothing
 - MeshIO = v"0.4.0"
 - Pixman_jll = v"0.40.0+0"
 - libass_jll = v"0.14.0+4"
 - FFTW = v"1.3.2"
 - IterativeSolvers = v"0.8.5"
 - GR = v"0.57.1"
 - Xorg_xkeyboard_config_jll = v"2.27.0+4"
 - UnicodeFun = v"0.4.1"
 - Packing = v"0.4.1"
 - PNGFiles = v"0.2.1"
 - Sundials = v"4.2.6"
 - TimeZones = v"1.5.3"
 - Xorg_xcb_util_image_jll = v"0.4.0+1"
 - CommonMark = v"0.6.4"
 - InteractiveUtils = nothing
 - StatsModels = v"0.6.21"
 - Polynomials = v"1.2.1"
 - Distributed = nothing
 - RecipesPipeline = v"0.3.2"
 - SuiteSparse = nothing
 - SortingAlgorithms = v"0.3.1"
 - CSTParser = v"2.5.0"
 - nghttp2_jll = v"1.40.0+2"
 - JLD = v"0.10.0"
 - KernelDensity = v"0.5.1"
 - DiffEqNoiseProcess = v"5.5.2"
 - Wayland_jll = v"1.17.0+4"
 - TerminalLoggers = v"0.1.3"
 - OffsetArrays = v"1.6.2"
 - Reexport = v"0.2.0"
 - JSON = v"0.21.1"
 - ColorTypes = v"0.10.12"
 - Inflate = v"0.1.2"
 - Tokenize = v"0.5.15"
 - Xorg_libXi_jll = v"1.7.10+4"
 - StaticArrays = v"0.12.5"
 - Media = v"0.5.0"
 - EllipsisNotation = v"1.0.0"
 - StatsBase = v"0.32.2"
 - URIParser = v"0.4.1"
 - iso_codes_jll = v"4.3.0+4"
 - DiffEqDiffTools = v"1.7.0"
 - ArrayInterface = v"2.14.17"
 - Xorg_libpthread_stubs_jll = v"0.1.0+3"
 - Unicode = nothing
 - LineSearches = v"7.1.1"
 - ConstructionBase = v"1.1.0"
 - Pango_jll = v"1.42.4+10"
 - CoupledFields = v"0.2.0"
 - MuladdMacro = v"0.2.2"
 - BoundaryValueDiffEq = v"2.7.1"
 - BenchmarkTools = v"0.7.0"
 - Wavelets = v"0.9.2"
 - at_spi2_core_jll = v"2.34.0+4"
 - StructTypes = v"1.5.2"
 - Compose = v"0.9.2"
 - Libmount_jll = v"2.34.0+3"
 - ColorBrewer = v"0.4.0"
 - Rmath = v"0.6.1"
 - Loess = v"0.5.3"
 - CUDA = v"2.4.1"
 - ICU_jll = v"67.1.0+3"
 - LibCURL_jll = v"7.70.0+2"
 - FreeType2_jll = v"2.10.1+5"
 - MappedArrays = v"0.3.0"
 - StochasticDiffEq = v"6.25.0"
 - Xorg_libxcb_jll = v"1.13.0+3"
 - ModernGL = v"1.1.2"
 - Gettext_jll = v"0.20.1+7"
 - ShaderAbstractions = v"0.2.5"
 - Distances = v"0.10.2"
 - SuiteSparse_jll = v"5.4.0+9"
 - FriBidi_jll = v"1.0.5+6"
 - Cairo = v"1.0.5"
 - MbedTLS_jll = v"2.16.8+1"
 - Scratch = v"1.0.3"
 - LLVM = v"3.6.0"
 - Mocking = v"0.7.1"
 - IntervalSets = v"0.5.3"
 - LibSSH2_jll = v"1.9.0+3"
 - Widgets = v"0.6.2"
 - DiffEqBase = v"6.44.3"
 - TreeViews = v"0.3.0"
 - Hwloc_jll = v"2.4.1+0"
 - UUIDs = nothing
 - PoissonRandom = v"0.4.0"
 - IntelOpenMP_jll = v"2018.0.3+2"
 - NaNMath = v"0.3.5"
 - ImageMagick = v"1.2.0"
 - Latexify = v"0.14.12"
 - Calculus = v"0.5.1"
 - Xorg_xcb_util_wm_jll = v"0.4.1+1"
 - IterTools = v"1.3.0"
 - Parameters = v"0.12.2"
 - Bzip2_jll = v"1.0.6+5"
 - HDF5 = v"0.13.6"
 - HarfBuzz_jll = v"2.6.1+10"
 - Gadfly = v"1.3.2"
 - PlotThemes = v"2.0.1"
 - Printf = nothing
 - DiffEqJump = v"6.10.0"
 - ProgressLogging = v"0.1.4"
 - RandomNumbers = v"1.4.0"

Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, broadwell)

Stacktrace:
 [1] check_ir!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.CallInst) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:246
 [2] check_ir!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.Function) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:140
 [3] check_ir!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::Array{Tuple{String,Array{Base.StackTraces.StackFrame,1},Any},1}, ::LLVM.Module) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:131
 [4] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:120
 [5] macro expansion at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:239 [inlined]
 [6] macro expansion at /home/dahlbom/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
 [7] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
 [8] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
 [9] compile at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
 [10] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/compiler/execution.jl:302
 [11] cufunction_compile(::GPUCompiler.FunctionSpec) at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/compiler/execution.jl:297
 [12] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#14",Tuple{CUDA.CuKernelContext,CuDeviceArray{Nothing,1,1},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},Main.MinimalFailure.var"#1#2",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Int64,1,1},Tuple{Bool},Tuple{Int64}}}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
 [13] broadcast_kernel at /home/dahlbom/.julia/packages/GPUArrays/bjw3g/src/host/broadcast.jl:57 [inlined]
 [14] cached_compilation at /home/dahlbom/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
 [15] cufunction(::GPUArrays.var"#broadcast_kernel#14", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Nothing,1,1},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},Main.MinimalFailure.var"#1#2",Tuple{Base.Broadcast.Extruded{CuDeviceArray{Int64,1,1},Tuple{Bool},Tuple{Int64}}}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/compiler/execution.jl:289
 [16] cufunction at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/compiler/execution.jl:286 [inlined]
 [17] macro expansion at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/compiler/execution.jl:100 [inlined]
 [18] #launch_heuristic#857 at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/gpuarrays.jl:17 [inlined]
 [19] launch_heuristic at /home/dahlbom/.julia/packages/CUDA/wTQsK/src/gpuarrays.jl:17 [inlined]
 [20] copyto! at /home/dahlbom/.julia/packages/GPUArrays/bjw3g/src/host/broadcast.jl:63 [inlined]
 [21] copyto! at ./broadcast.jl:886 [inlined]
 [22] copy at /home/dahlbom/.julia/packages/GPUArrays/bjw3g/src/host/broadcast.jl:47 [inlined]
 [23] materialize(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Nothing,Main.MinimalFailure.var"#1#2",Tuple{CuArray{Int64,1}}}) at ./broadcast.jl:837
 [24] map(::Function, ::CuArray{Int64,1}) at /home/dahlbom/.julia/packages/GPUArrays/bjw3g/src/host/broadcast.jl:86
 [25] top-level scope at /home/dahlbom/research/jcuda/MinimalFailure.jl:13
 [26] include(::String) at ./client.jl:457
 [27] top-level scope at REPL[4]:1
in expression starting at /home/dahlbom/research/jcuda/MinimalFailure.jl:13

CPU execution

It would be useful if we could execute code on the CPU, both for testing and to extend the usability of this package. Regular execution should be pretty easy:

using GPUCompiler, LLVM


## runtime implementation

module NativeRuntime

# FIXME: actually implement these
signal_exception() = return
malloc(sz) = return
report_oom(sz) = return
report_exception(ex) = return
report_exception_name(ex) = return
report_exception_frame(idx, func, file, line) = return
end

## target

struct NativeCompilerTarget <: AbstractCompilerTarget
end

GPUCompiler.runtime_module(::NativeCompilerTarget) = NativeRuntime

GPUCompiler.llvm_triple(::NativeCompilerTarget) = Sys.MACHINE


## job

struct NativeCompilerJob <: AbstractCompilerJob
    target::NativeCompilerTarget
    source::FunctionSpec
end

Base.similar(job::NativeCompilerJob, source::FunctionSpec) =
    NativeCompilerJob(job.target, source)

GPUCompiler.target(job::NativeCompilerJob) = job.target
GPUCompiler.source(job::NativeCompilerJob) = job.source

GPUCompiler.runtime_slug(::AbstractCompilerJob) = "native"


## main

function kernel()
end

function run(mod::LLVM.Module, entry::LLVM.Function)
    res_jl = 0.0
    LLVM.JIT(mod) do engine
        f = LLVM.functions(engine)[LLVM.name(entry)]
        res = LLVM.run(engine, f)
        LLVM.dispose(res)
    end
    return
end

function main()
    target = NativeCompilerTarget()
    source = FunctionSpec(kernel)
    job = NativeCompilerJob(target,source)

    mod, entry = GPUCompiler.compile(:llvm, job)

    run(mod, entry)
end

Left to implement is the runtime, we could print by e.g. linking the C runtime and calling printf. However, it would be vastly more useful if we could actually reuse the full Julia runtime. This should be possible with the LLVM ORC JIT, which supports looking external functions and globals. https://www.doof.me.uk/2017/05/11/using-orc-with-llvms-c-api/

Killed exception argument stack slots are not cleaned-up

Dead simple kernel without inbounds:

julia> function kernel(eds)
         eds[1] = 1
         return nothing
       end

julia> CUDA.code_llvm(kernel, Tuple{CuDeviceArray{Int,2,AS.Global}}; debuginfo=:none)

define dso_local void @julia_kernel_1546({ [2 x i64], i64 }* nocapture nonnull readonly dereferenceable(24)) local_unnamed_addr {
top:
  %1 = alloca [1 x i64], align 8
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %1, i64 0, i64 0
  store i64 1, i64* %2, align 8
  %3 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 0
  %4 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 1
  %5 = load i64, i64* %3, align 8
  %6 = load i64, i64* %4, align 8
  %7 = mul i64 %6, %5
  %8 = icmp slt i64 %7, 1
  br i1 %8, label %L14, label %L17

L14:                                              ; preds = %top
  call fastcc void @julia_throw_boundserror_1548()
  call void asm sideeffect "exit;", ""() #2
  br label %L17

L17:                                              ; preds = %L14, %top
  %9 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 1
  %10 = bitcast i64* %9 to i64 addrspace(1)**
  %11 = load i64 addrspace(1)*, i64 addrspace(1)** %10, align 8
  store i64 1, i64 addrspace(1)* %11, align 8
  ret void
}

Nasty, unused alloca that trashes performance (causes local memory access by every CUDA thread).

Unoptimized IR confirms this is the BoundsError:

julia> CUDA.code_llvm(kernel, Tuple{CuDeviceArray{Int,2,AS.Global}}; debuginfo=:none, optimize=false)

define dso_local void @julia_kernel_1561({ [2 x i64], i64 } addrspace(11)* nocapture nonnull readonly dereferenceable(24)) local_unnamed_addr {
top:
  %1 = alloca [1 x i64]
...
L2:                                               ; preds = %top
  %7 = getelementptr inbounds [1 x i64], [1 x i64]* %1, i32 0, i32 0
  store i64 1, i64* %7, align 8
...
L14:                                              ; preds = %L2
  %26 = addrspacecast [1 x i64]* %1 to [1 x i64] addrspace(11)*
  %27 = call fastcc nonnull %jl_value_t addrspace(10)* @julia_throw_boundserror_1563({ [2 x i64], i64 } addrspace(11)* nocapture readonly %0, [1 x i64] addrspace(11)* nocapture readonly %26)
  call void asm sideeffect "exit;", ""()
  br label %L13
...
}

Doing an additional SROA cleans that up:

$ opt -sroa --filter-print-funcs=julia_kernel_1574 --print-before-all --print-after-all test.ll -o /dev/null                                                                                                                 
*** IR Dump Before SROA ***
define dso_local void @julia_kernel_1574({ [2 x i64], i64 }* nocapture nonnull readonly dereferenceable(24) %0) local_unnamed_addr {
top:
  %1 = alloca [1 x i64], align 8
  %2 = getelementptr inbounds [1 x i64], [1 x i64]* %1, i64 0, i64 0
  store i64 1, i64* %2, align 8
  %3 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 0
  %4 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 1
  %5 = load i64, i64* %3, align 8
  %6 = load i64, i64* %4, align 8
  %7 = mul i64 %6, %5
  %8 = icmp slt i64 %7, 1
  br i1 %8, label %L14, label %L17
L14:                                              ; preds = %top
  call fastcc void @julia_throw_boundserror_1576()
  call void asm sideeffect "exit;", ""() #2
  br label %L17
L17:                                              ; preds = %L14, %top
  %9 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 1
  %10 = bitcast i64* %9 to i64 addrspace(1)**
  %11 = load i64 addrspace(1)*, i64 addrspace(1)** %10, align 8
  store i64 1, i64 addrspace(1)* %11, align 8
  ret void
}
*** IR Dump After SROA ***
define dso_local void @julia_kernel_1574({ [2 x i64], i64 }* nocapture nonnull readonly dereferenceable(24) %0) local_unnamed_addr {
top:
  %1 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 0
  %2 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 0, i64 1
  %3 = load i64, i64* %1, align 8
  %4 = load i64, i64* %2, align 8
  %5 = mul i64 %4, %3
  %6 = icmp slt i64 %5, 1
  br i1 %6, label %L14, label %L17
L14:                                              ; preds = %top
  call fastcc void @julia_throw_boundserror_1576()
  call void asm sideeffect "exit;", ""() #2
  br label %L17
L17:                                              ; preds = %L14, %top
  %7 = getelementptr inbounds { [2 x i64], i64 }, { [2 x i64], i64 }* %0, i64 0, i32 1
  %8 = bitcast i64* %7 to i64 addrspace(1)**
  %9 = load i64 addrspace(1)*, i64 addrspace(1)** %8, align 8
  store i64 1, i64 addrspace(1)* %9, align 8
  ret void
}

TODO: shouldn't Base's pipeline catch this? If not, add this to the GPU-specific pipeline.

julia_method_table function names

As encountered on 1.7:

@0 = private unnamed_addr constant [36 x i8] c"ERROR: Out-of-bounds array access.\0A\00", align 1

; Function Attrs: noinline noreturn
define internal fastcc void @julia_method_table_3017() unnamed_addr #2 {
top:
  %0 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([36 x i8], [36 x i8]* @0, i64 0, i64 0))
  call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception.16 to i64))
  call fastcc void @gpu_signal_exception()
  call void @llvm.trap()
  unreachable
}

L67.i:                                            ; preds = %L51.i
  %58 = getelementptr inbounds { [1 x [2 x [1 x i64]]] }, { [1 x [2 x [1 x i64]]] }* %2, i64 0, i32 0
  call fastcc void @julia_method_table_3017() #0
  unreachable

This is a @device_override method from quirks.jl, originally Base.throw_boundserror. Looks like the method overlay tables somehow mess with the function name.

unsupported use of an undefined name (use of 'pow')

With

CUDA #master
GPUCompiler          : 0.11.3

this code

using CUDA
a = CUDA.ones(10)
function kernel(a)
    i = threadIdx().x
    a[i] = CUDA.pow(a[i], a[i])
    return nothing
end
@cuda threads=length(a) kernel(a)

fails

ERROR: InvalidIRError: compiling kernel kernel(CuDeviceVector{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] kernel
   @ REPL[7]:3
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex!
   @ ~\.julia\packages\CUDA\RMfZH\src\device\array.jl:103
 [2] kernel
   @ REPL[7]:3
Reason: unsupported use of an undefined name (use of 'pow')

Note I confirm with above setup that #173 is actually fixed with latest GPUCompiler. So this one is new

Idea: rename safe logging macros

None of GPUCompiler's code should be switching tasks, so maybe we should replace the Base logging macros by the safe variants from GPUCompiler.

Manual compile cache

For container purposes, it would be useful to serialize the compilation cache and query it based on the types of the arguments instead of the usual 265-based approach.

World age behavior different from Base

Wanted behavior

Given the following function, which defines a module and function at run time, and tries to call them:

function main()
    arr = zeros(Int)

    @gensym mod_name
    mod = @eval(Main, module $mod_name end)
    @eval mod begin
        function kernel(ptr)
            ptr[] = 1
            return
        end
    end

    mod.kernel(arr)
    @assert Array(arr)[] == 1
end

This results in a MethodError due to world age issues:

ERROR: MethodError: no method matching kernel(::Array{Int64, 0})
The applicable method may be too new: running in world age 29740, while current world is 29744.
Closest candidates are:
  kernel(::Any) at /home/tim/Julia/pkg/CUDA/wip.jl:9 (method too new to be called from this world context.)

To call this function, we need invokelatest.


Inconsistency 1

If we replace this by the CUDA equivalent, we can call this function perfectly fine:

function main()
    arr = CuArray(zeros(Int))

    @gensym mod_name
    mod = @eval(Main, module $mod_name end)
    @eval mod begin
        function kernel(ptr)
            ptr[] = 1
            return
        end
    end

    @cuda mod.kernel(arr)
    @assert Array(arr)[] == 1
end

This shouldn't be possible, and we should throw a MethodError instead.

Inconsistency 2

If we redefine the method in that module, our method invalidation breaks:

function main()
    # same as before

    arr = CuArray(zeros(Int))

    @gensym mod_name
    mod = @eval(Main, module $mod_name end)
    @eval mod begin
        function kernel(ptr)
            ptr[] = 1
            return
        end
    end

    @cuda mod.kernel(arr)
    @assert Array(arr)[] == 1


    # redefine

    @eval mod begin
        function kernel(ptr)
            ptr[] = 2
            return
        end
    end

    @cuda mod.kernel(arr)
    @assert Array(arr)[] == 2
end
ERROR: AssertionError: (Array(arr))[] == 2

To fix the first inconsistency, I think we need to use a proper world age in the generated function that looks up the MethodInstance to compile, i.e., get rid of the typemax(UInt) here:

@generated function cached_compilation(cache::Dict,
job::CompilerJob{<:Any,<:Any,FunctionSpec{f,tt}},
compiler::Function, linker::Function) where {f,tt}
# get a hold of the method and code info of the kernel function
sig = Tuple{f, tt.parameters...}
mthds = _methods_by_ftype(sig, -1, typemax(UInt))

However, we can't just use get_world_counter() there, as this generator runs in a newer world where the kernel has been defined already. Maybe we need an intrinsic that bakes the world age at inference time into the AST, to pass to cached_compilation such that it can look up the method in the correct world? @vtjnash, any thoughts?

For the second inconsistency, or issue really, I'm not sure what's happening here. @vchuravy, you looked at this last time, any ideas?

Multithreaded compilation needs to take codegen lock

 signal (11): Segmentation fault
in expression starting at /builds/JuliaGPU/CUDAnative.jl/test/device/execution.jl:1
llvm::ValueHandleBase::AddToUseList() at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::BranchProbabilityInfo::setEdgeProbability(llvm::BasicBlock const*, unsigned int, llvm::BranchProbability) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::BranchProbabilityInfo::calcUnreachableHeuristics(llvm::BasicBlock const*) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::BranchProbabilityInfo::calculate(llvm::Function const&, llvm::LoopInfo const&, llvm::TargetLibraryInfo const*) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::BranchProbabilityInfoWrapperPass::runOnFunction(llvm::Function&) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::FPPassManager::runOnFunction(llvm::Function&) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::FPPassManager::runOnModule(llvm::Module&) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
llvm::legacy::PassManagerImpl::run(llvm::Module&) at /usr/local/bin/../lib/julia/libLLVM-8jl.so (unknown line)
operator() at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:356
addModule at /buildworker/worker/package_linux64/build/usr/include/llvm/ExecutionEngine/Orc/IRCompileLayer.h:84 [inlined]
addModule at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:485
jl_add_to_ee at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:747 [inlined]
jl_finalize_function at /buildworker/worker/package_linux64/build/src/jitlayers.cpp:755
getAddressForFunction at /buildworker/worker/package_linux64/build/src/codegen.cpp:1414
jl_generate_fptr at /buildworker/worker/package_linux64/build/src/codegen.cpp:1510
jl_compile_method_internal at /buildworker/worker/package_linux64/build/src/gf.c:1912
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2153 [inlined]
jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2165
launch at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:85 [inlined]
#662 at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:164 [inlined]
macro expansion at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:125 [inlined]
convert_arguments at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:108 [inlined]
#cudacall#661 at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:163 [inlined]
cudacall at /builds/JuliaGPU/CUDAnative.jl/.julia/packages/CUDAdrv/YK1gX/src/execution.jl:163 [inlined]
#cudacall#142 at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:261 [inlined]
cudacall at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:258 [inlined]
macro expansion at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:242 [inlined]
#call#130 at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:219 [inlined]
call at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:219 [inlined]
#_#145 at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:388 [inlined]
HostKernel at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:388 [inlined]
macro expansion at /builds/JuliaGPU/CUDAnative.jl/src/execution.jl:158 [inlined]
macro expansion at /builds/JuliaGPU/CUDAnative.jl/test/device/execution.jl:1069 [inlined]
JuliaGPU/CUDAnative.jl#627#threadsfor_fun at ./threadingconstructs.jl:61
JuliaGPU/CUDAnative.jl#627#threadsfor_fun at ./threadingconstructs.jl:28
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2144 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2322
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1700 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:687
unknown function (ip: (nil))
Allocations: 117779425 (Pool: 117751846; Big: 27579); GC: 128

Happens now and then on CI. Probably due to modifying the LLVMContext concurrently with Julia. Should probably take Julia's codegen_lock, if feasible (there's a C exported accessor).

Call has wrong number of parameters\nptxas fatal

Today I see an error when using CuArrays. I wonder if there is something wrong

ERROR: LoadError: CUDAdrv.CuError(CUDAdrv.cudaError_enum(0x000000da), "ptxas application ptx input, line 380; error   : Call has wrong number of parameters\nptxas fatal   : Ptx assembly aborted due to errors")
Stacktrace:
 [1] CUDAdrv.CuModule(::String, ::Dict{CUDAdrv.CUjit_option_enum,Any}) at /home/gzhang8/.julia/packages/CUDAdrv/Uc14X/src/module.jl:40
 [2] _cufunction(::GPUCompiler.FunctionSpec{GPUArrays.var"#26#27",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/gzhang8/.julia/packages/CUDAnative/e0IdN/src/execution.jl:335
 [3] _cufunction at /home/gzhang8/.julia/packages/CUDAnative/e0IdN/src/execution.jl:302 [inlined]
 [4] #77 at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/src/cache.jl:21 [inlined]
 [5] get!(::GPUCompiler.var"#77#78"{Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},typeof(CUDAnative._cufunction),GPUCompiler.FunctionSpec{GPUArrays.var"#26#27",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}}}, ::Dict{UInt64,Any}, ::UInt64) at ./dict.jl:452
 [6] macro expansion at ./lock.jl:183 [inlined]
 [7] check_cache(::typeof(CUDAnative._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#26#27",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/src/cache.jl:19
 [8] + at ./int.jl:53 [inlined]
 [9] hash_64_64 at ./hashing.jl:35 [inlined]
 [10] hash_uint64 at ./hashing.jl:62 [inlined]
 [11] hx at ./float.jl:568 [inlined]
 [12] hash at ./float.jl:571 [inlined]
 [13] cached_compilation(::typeof(CUDAnative._cufunction), ::GPUCompiler.FunctionSpec{GPUArrays.var"#26#27",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/src/cache.jl:0
 [14] cached_compilation(::Function, ::GPUCompiler.FunctionSpec{GPUArrays.var"#26#27",Tuple{CuArrays.CuKernelContext,CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}}, ::UInt64) at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/src/cache.jl:37
 [15] cufunction(::Function, ::Type; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/gzhang8/.julia/packages/CUDAnative/e0IdN/src/execution.jl:296
 [16] macro expansion at /home/gzhang8/.julia/packages/CUDAnative/e0IdN/src/execution.jl:108 [inlined]
 [17] gpu_call(::CuArrays.CuArrayBackend, ::Function, ::Tuple{CuArrays.CuArray{Float32,1,Nothing},Base.Broadcast.Broadcasted{Nothing,Tuple{Base.OneTo{Int64}},typeof(identity),Tuple{Float32}}}, ::Int64; name::String) at /home/gzhang8/.julia/packages/CuArrays/l0gXB/src/gpuarrays.jl:32
 [18] #gpu_call#1 at /home/gzhang8/.julia/packages/GPUArrays/OXvxB/src/device/execution.jl:61 [inlined]
 [19] copyto! at /home/gzhang8/.julia/packages/GPUArrays/OXvxB/src/host/broadcast.jl:63 [inlined]
 [20] copyto! at /home/gzhang8/.julia/packages/GPUArrays/OXvxB/src/host/broadcast.jl:75 [inlined]

I also try to run test of GPUCompiler. I see one error as well.

test GPUCompiler
    Testing GPUCompiler
Downloading artifact: LLVM
######################################################################## 100.0%#=#=-#  #                                                             Status `/tmp/jl_P2HEMf/Manifest.toml`
  [fa961155] CEnum v0.2.0
  [da1fd8a2] CodeTracking v0.5.11
  [f68482b8] Cthulhu v1.1.1
  [864edb3b] DataStructures v0.17.16
  [61eb1bfa] GPUCompiler v0.2.0
  [929cbde3] LLVM v1.4.1
  [86de99a1] LLVM_jll v8.0.1+0
  [bac558e1] OrderedCollections v1.2.0
  [a759f4b9] TimerOutputs v0.5.5
  [2a0f44e3] Base64 
  [ade2ca70] Dates 
  [8ba89e20] Distributed 
  [b77e0a4c] InteractiveUtils 
  [76f85450] LibGit2 
  [8f399da3] Libdl 
  [56ddb016] Logging 
  [d6f4376e] Markdown 
  [44cfe95a] Pkg 
  [de0858da] Printf 
  [3fa0cd96] REPL 
  [9a3f8284] Random 
  [ea8e919c] SHA 
  [9e88b42a] Serialization 
  [6462fe0b] Sockets 
  [8dfed614] Test 
  [cf7118a7] UUIDs 
  [4ec0a83e] Unicode 
GC and TLS lowering: Test Failed at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/test/ptx.jl:241
  Expression: !(occursin("gpu_gc_pool_alloc", asm))
   Evaluated: !(occursin("gpu_gc_pool_alloc", "//\n// Generated by LLVM NVPTX Back-End\n//\n\n.version 6.0\n.target sm_70\n.address_size 64\n\n\t// .globl\tjulia_ref_kernel_18360 // -- Begin function julia_ref_kernel_18360\n.func  (.param .b64 func_retval0) gpu_gc_pool_alloc\n(\n\t.param .b64 gpu_gc_pool_alloc_param_0\n)\n;\n.global .align 1 .b8 exception[10] = {101, 120, 99, 101, 112, 116, 105, 111, 110, 0};\n                                        // @julia_ref_kernel_18360\n.visible .func julia_ref_kernel_18360(\n\t.param .b64 julia_ref_kernel_18360_param_0,\n\t.param .b64 julia_ref_kernel_18360_param_1\n)\n{\n\t.reg .pred \t%p<2>;\n\t.reg .b64 \t%rd<10>;\n\n// %bb.0:                               // %top\n\tld.param.u64 \t%rd1, [julia_ref_kernel_18360_param_0];\n\tmov.u64 \t%rd2, 8;\n\t{ // callseq 34, 0\n\t.reg .b32 temp_param_reg;\n\t.param .b64 param0;\n\tst.param.b64 \t[param0+0], %rd2;\n\t.param .b64 retval0;\n\tcall.uni (retval0), \n\tgpu_gc_pool_alloc, \n\t(\n\tparam0\n\t);\n\tld.param.b64 \t%rd3, [retval0+0];\n\t} // callseq 34\n\tld.param.u64 \t%rd5, [julia_ref_kernel_18360_param_1];\n\tsetp.lt.s64 \t%p1, %rd5, 2;\n\tselp.b64 \t%rd6, 2, 1, %p1;\n\tst.u64 \t[%rd3], %rd6;\n\tshl.b64 \t%rd7, %rd5, 3;\n\tadd.s64 \t%rd8, %rd1, %rd7;\n\tmov.u64 \t%rd9, 0;\n\tst.u8 \t[%rd8+-1], %rd9;\n\tst.u8 \t[%rd8+-2], %rd9;\n\tst.u8 \t[%rd8+-3], %rd9;\n\tst.u8 \t[%rd8+-4], %rd9;\n\tst.u8 \t[%rd8+-5], %rd9;\n\tst.u8 \t[%rd8+-6], %rd9;\n\tst.u8 \t[%rd8+-7], %rd9;\n\tst.u8 \t[%rd8+-8], %rd6;\n\tret;\n                                        // -- End function\n}\n.func  (.param .b64 func_retval0) gpu_malloc(\n\t.param .b64 gpu_malloc_param_0\n)                                       // -- Begin function gpu_malloc\n                                        // @gpu_malloc\n{\n\t.reg .b64 \t%rd<2>;\n\n// %bb.0:                               // %top\n\tmov.u64 \t%rd1, 0;\n\tst.param.b64 \t[func_retval0+0], %rd1;\n\tret;\n                                        // -- End function\n}\n.func gpu_report_exception(\n\t.param .b64 gpu_report_exception_param_0\n)                                       // -- Begin function gpu_report_exception\n                                        // @gpu_report_exception\n{\n\n\n// %bb.0:                               // %top\n\tret;\n                                        // -- End function\n}\n.func gpu_report_oom(\n\t.param .b64 gpu_report_oom_param_0\n)                                       // -- Begin function gpu_report_oom\n                                        // @gpu_report_oom\n{\n\n\n// %bb.0:                               // %top\n\tret;\n                                        // -- End function\n}\n.func gpu_signal_exception()            // -- Begin function gpu_signal_exception\n                                        // @gpu_signal_exception\n{\n\n\n// %bb.0:                               // %top\n\tret;\n                                        // -- End function\n}\n.func  (.param .b64 func_retval0) gpu_gc_pool_alloc(\n\t.param .b64 gpu_gc_pool_alloc_param_0\n)                                       // -- Begin function gpu_gc_pool_alloc\n                                        // @gpu_gc_pool_alloc\n{\n\t.reg .pred \t%p<2>;\n\t.reg .b64 \t%rd<6>;\n\n// %bb.0:                               // %top\n\tld.param.u64 \t%rd2, [gpu_gc_pool_alloc_param_0];\n\t{ // callseq 35, 0\n\t.reg .b32 temp_param_reg;\n\t.param .b64 param0;\n\tst.param.b64 \t[param0+0], %rd2;\n\t.param .b64 retval0;\n\tcall.uni (retval0), \n\tgpu_malloc, \n\t(\n\tparam0\n\t);\n\tld.param.b64 \t%rd3, [retval0+0];\n\t} // callseq 35\n\tsetp.ne.s64 \t%p1, %rd3, 0;\n\t@%p1 bra \tLBB5_2;\n// %bb.1:                               // %L7\n\t{ // callseq 36, 0\n\t.reg .b32 temp_param_reg;\n\t.param .b64 param0;\n\tst.param.b64 \t[param0+0], %rd2;\n\tcall.uni \n\tgpu_report_oom, \n\t(\n\tparam0\n\t);\n\t} // callseq 36\n\tmov.u64 \t%rd4, exception;\n\tcvta.global.u64 \t%rd5, %rd4;\n\t{ // callseq 37, 0\n\t.reg .b32 temp_param_reg;\n\t.param .b64 param0;\n\tst.param.b64 \t[param0+0], %rd5;\n\tcall.uni \n\tgpu_report_exception, \n\t(\n\tparam0\n\t);\n\t} // callseq 37\n\t{ // callseq 38, 0\n\t.reg .b32 temp_param_reg;\n\tcall.uni \n\tgpu_signal_exception, \n\t(\n\t);\n\t} // callseq 38\n\t// begin inline asm\n\texit;\n\t// end inline asm\nLBB5_2:                                 // %L10\n\tst.param.b64 \t[func_retval0+0], %rd3;\n\tret;\n                                        // -- End function\n}\n\n"))
Stacktrace:
 [1] top-level scope at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/test/ptx.jl:241
 [2] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113
 [3] top-level scope at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/test/ptx.jl:204
 [4] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113
 [5] top-level scope at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/test/ptx.jl:79
 [6] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113
 [7] top-level scope at /home/gzhang8/.julia/packages/GPUCompiler/bwcs0/test/ptx.jl:3
Test Summary:                      | Pass  Fail  Total
GPUCompiler                        |   72     1     73
  native                           |   39           39
  PTX                              |   33     1     34
    IR                             |   18           18

I tried difference version of cuda and driver (440 10.2, 435 10.1, 41x 10.0). I also tried Julia 1.3 and 1.4. All have this problem. Any clue? Thank you in advance

GPUCompiler doesn't work on `aarch64-apple-darwin`

julia> using LLVMExtra_jll

julia> LLVMExtra_jll.is_available()
true

julia> LLVMExtra_jll.libLLVMExtra
"/Users/mose/.julia/artifacts/548b26fac5815e9981ab5c77077a1bd8db562747/lib/libLLVMExtra-12.dylib"

julia> LLVMExtra_jll.libLLVMExtra_handle
Ptr{Nothing} @0x000000012b03c590

julia> filter(l -> occursin("LLVMExtra", l), readlines(`nm $(LLVMExtra_jll.libLLVMExtra_path)`))
14-element Vector{String}:
 "000000000000b430 T _LLVMExtraAddGenericAnalysisPasses"
 "000000000000b7d0 T _LLVMExtraAddNamedMetadataOperand2"
 "000000000000b2f8 T _LLVMExtraAppendToCompilerUsed"
 "000000000000b1c0 T _LLVMExtraAppendToUsed"
 "000000000000b508 T _LLVMExtraDIScopeGetName"
 "000000000000b52c T _LLVMExtraDumpMetadata"
 "000000000000b6e8 T _LLVMExtraGetMDNodeNumOperands2"
 "000000000000b6f0 T _LLVMExtraGetMDNodeOperands2"
 "000000000000b6bc T _LLVMExtraGetMDString2"
 "000000000000b774 T _LLVMExtraGetNamedMetadataNumOperands2"
 "000000000000b778 T _LLVMExtraGetNamedMetadataOperands2"
 "000000000000b55c T _LLVMExtraPrintMetadataToString"
 "000000000000b7d4 T _LLVMExtraSetInitializer"
 "000000000000b7d8 T _LLVMExtraSetPersonalityFn"

julia> using GPUCompiler
ERROR: InitError: could not load symbol "LLVMExtraInitializeAllTargets":
dlsym(RTLD_DEFAULT, LLVMExtraInitializeAllTargets): symbol not found
Stacktrace:
 [1] LLVMInitializeAllTargets
   @ ~/.julia/packages/LLVM/srSVa/lib/libLLVM_extra.jl:10 [inlined]
 [2] InitializeAllTargets
   @ ~/.julia/packages/LLVM/srSVa/src/init.jl:58 [inlined]
 [3] __init__()
   @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/GPUCompiler.jl:50
 [4] _include_from_serialized(path::String, depmods::Vector{Any})
   @ Base ./loading.jl:768
 [5] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String)
   @ Base ./loading.jl:854
 [6] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1097
 [7] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [8] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997
during initialization of module GPUCompiler

julia> versioninfo()
Julia Version 1.8.0-DEV.151
Commit 6240d352f1 (2021-07-07 19:24 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.0 (ORCJIT, cyclone)

This means that also all packages depending on GPUCompiler.jl can't be loaded, including CUDA.jl.

The package loads when running Julia with Rosetta, but nm finds the same symbols in the x86_64 library.

Aarch64: `Closures are not supported on this platform`

(Issue based on this slack discussion)

Using the following MWE with the current master of CUDA.jl throws an error ERROR: cfunction: closures are not supported on this platform:

using Flux
input = rand(Float32, 50,50,1,3) |> gpu
conv = Conv((2,1), 1=>4) |> gpu
conv(input)

Additional information:

Full stacktrace:

ERROR: cfunction: closures are not supported on this platform
Stacktrace:
  [1] compile_method_instance(job::GPUCompiler.CompilerJob, method_instance::Core.MethodInstance)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/eJOtJ/src/jlgen.jl:325
  [2] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [3] irgen(job::GPUCompiler.CompilerJob, method_instance::Core.MethodInstance)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/eJOtJ/src/irgen.jl:4
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/eJOtJ/src/driver.jl:142 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/eJOtJ/src/driver.jl:141 [inlined]
  [7] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any, world::UInt64; libraries::Bool, deferred_codegen::Bool, optimize::Bool, only_entry::Bool)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/eJOtJ/src/utils.jl:62
  [8] emit_llvm(job::GPUCompiler.CompilerJob, method_instance::Any, world::UInt64)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/eJOtJ/src/utils.jl:60
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/3VnCC/src/compiler/execution.jl:300
 [10] check_cache
    @ ~/.julia/packages/GPUCompiler/eJOtJ/src/cache.jl:47 [inlined]
 [11] cached_compilation
    @ ~/.julia/packages/GPUArrays/4n0iS/src/host/broadcast.jl:57 [inlined]
 [12] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#16", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}}}, Int64}}}, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/eJOtJ/src/cache.jl:0
 [13] cufunction(f::GPUArrays.var"#broadcast_kernel#16", tt::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}}}, Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/3VnCC/src/compiler/execution.jl:289
 [14] cufunction
    @ ~/.julia/packages/CUDA/3VnCC/src/compiler/execution.jl:283 [inlined]
 [15] macro expansion
    @ ~/.julia/packages/CUDA/3VnCC/src/compiler/execution.jl:102 [inlined]
 [16] #launch_heuristic#286
    @ ~/.julia/packages/CUDA/3VnCC/src/gpuarrays.jl:17 [inlined]
 [17] launch_heuristic
    @ ~/.julia/packages/CUDA/3VnCC/src/gpuarrays.jl:17 [inlined]
 [18] copyto!
    @ ~/.julia/packages/GPUArrays/4n0iS/src/host/broadcast.jl:63 [inlined]
 [19] copyto!
    @ ./broadcast.jl:936 [inlined]
 [20] copy
    @ ~/.julia/packages/GPUArrays/4n0iS/src/host/broadcast.jl:47 [inlined]
 [21] materialize
    @ ./broadcast.jl:883 [inlined]
 [22] (::Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4}, CUDA.CuArray{Float32, 1}})(x::CUDA.CuArray{Float32, 4})
    @ Flux ~/.julia/packages/Flux/6o4DQ/src/layers/conv.jl:157
 [23] top-level scope
    @ REPL[62]:1

Versioninfo:

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (aarch64-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cortex-a57)
Environment:
  JULIA_EDITOR = "/home/sdobber/.vscode-server/bin/cfa2e218100323074ac1948c885448fdf4de2a7f/node"
  JULIA_NUM_THREADS = 

CUDA Versioninfo:

CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: missing
- CUDNN: 8.0.0 (for CUDA 10.2.0)
┌ Warning: Could not find or load CUTENSOR; run with JULIA_DEBUG=CUDA for more details.
└ @ CUDA ~/.julia/packages/CUDA/3VnCC/deps/bindeps.jl:409
- CUTENSOR: missing

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5
- Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

1 device:
  0: NVIDIA Tegra X1 (sm_53, 541.133 MiB / 3.863 GiB available)

Stack overflow during invalidation

error in invalidation callback: StackOverflowError()
lookup_typevalue at /buildworker/worker/package_linux64/build/src/jltypes.c:598
jl_inst_arg_tuple_type at /buildworker/worker/package_linux64/build/src/jltypes.c:1500
jl_f_tuple at /buildworker/worker/package_linux64/build/src/builtins.c:743
iterate at ./array.jl:777 [inlined]
iterate at ./array.jl:777
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:82
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
invalidate at /home/tim/Julia/depot/packages/GPUCompiler/QjFdA/src/jlgen.jl:83
unknown function (ip: 0x7f20a0645e38)
_require at ./loading.jl:998

cc @vchuravy

Hitting the Julia cache with GPUCompiler on 1.5-dev

using GPUCompiler

# the GPU runtime library
module TestRuntime
    # dummy methods
    signal_exception() = return
    malloc(sz) = return
    report_oom(sz) = return
    report_exception(ex) = return
    report_exception_name(ex) = return
    report_exception_frame(idx, func, file, line) = return
end

struct NativeTestCompilerTarget <: CompositeCompilerTarget
    parent::NativeCompilerTarget

    NativeTestCompilerTarget() = new(NativeCompilerTarget())
end

Base.parent(target::NativeTestCompilerTarget) = target.parent

struct NativeTestCompilerJob <: CompositeCompilerJob
    parent::AbstractCompilerJob
end

GPUCompiler.runtime_module(target::NativeTestCompilerTarget) = TestRuntime

NativeTestCompilerJob(target::AbstractCompilerTarget, source::FunctionSpec; kwargs...) =
    NativeTestCompilerJob(NativeCompilerJob(target, source; kwargs...))

Base.similar(job::NativeTestCompilerJob, source::FunctionSpec; kwargs...) =
    NativeTestCompilerJob(similar(job.parent, source; kwargs...))

Base.parent(job::NativeTestCompilerJob) = job.parent

function obtain(func, types, kernel=false)
  source = FunctionSpec(func, Base.to_tuple_type(types), kernel)
  target = NativeTestCompilerTarget()
  job = NativeTestCompilerJob(target, source)
  return GPUCompiler.codegen(:llvm, job)
end

Calling the function first changes compilation

julia> g(x) = cos(x)
g (generic function with 1 method)

julia> mod, _ = obtain(g, (Float64,));

julia> g(1.0)
0.5403023058681398

julia> mod2, _ = obtain(g, (Float64,));
ERROR: InvalidIRError: compiling function g(Float64) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jfptr_cos_980)
Reason: unsupported call to the Julia runtime (call to malloc)
Stacktrace:
 [1] malloc at /home/vchuravy/src/GPUCompiler/src/runtime.jl:91
 [2] gc_pool_alloc at /home/vchuravy/src/GPUCompiler/src/runtime.jl:133
Reason: unsupported call to an unknown function (call to report_oom)
Stacktrace:
 [1] report_oom at /home/vchuravy/src/GPUCompiler/src/runtime.jl:91
 [2] gc_pool_alloc at /home/vchuravy/src/GPUCompiler/src/runtime.jl:135
Stacktrace:
 [1] check_ir(::NativeTestCompilerJob, ::LLVM.Module) at /home/vchuravy/src/GPUCompiler/src/validation.jl:118
 [2] macro expansion at /home/vchuravy/src/GPUCompiler/src/driver.jl:185 [inlined]
 [3] macro expansion at /home/vchuravy/.julia/packages/TimerOutputs/7Id5J/src/TimerOutput.jl:228 [inlined]
 [4] codegen(::Symbol, ::NativeTestCompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, strict::Bool) at /home/vchuravy/src/GPUCompiler/src/driver.jl:183
 [5] codegen at /home/vchuravy/src/GPUCompiler/src/driver.jl:60 [inlined]
 [6] obtain(::Function, ::Tuple{DataType}, ::Bool) at /home/vchuravy/src/GPUCompiler/test.jl:40
 [7] obtain(::Function, ::Tuple{DataType}) at /home/vchuravy/src/GPUCompiler/test.jl:37
 [8] top-level scope at REPL[4]:1

Bypassing the verficiation check, shows that we are

julia> using LLVM
julia>  GPUCompiler.check_ir(job::NativeTestCompilerJob, args...) = nothing
julia>  mod, _ = obtain(g, (Float64,));

julia> name.(LLVM.functions(mod))
10-element Array{String,1}:
 "julia_g_1158"
 "j_cos_1159"
 "tojlinvoke1160"
 "jfptr_cos_304"
 "llvm.trap"
 "malloc"
 "gpu_report_exception"
 "gpu_signal_exception"
 "gpu_gc_pool_alloc"
 "report_oom"

Shows that instead of getting julia_cos we are emitting a trampoling j_cos that calls tojlinvoke.

define internal fastcc %jl_value_t addrspace(10)* @tojlinvoke1160(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) unnamed_addr #1 {
top:
  %3 = call %jl_value_t addrspace(10)* @jfptr_cos_304(%jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)** %1, i32 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140159746853232 to %jl_value_t*) to %jl_value_t addrspace(10)*))
  ret %jl_value_t addrspace(10)* %3
}

Slow kernel compilation

IIRC it used to be 200us or so, and @vtjnash mentioned Base compilation taking 1us 1ms or so for the simplest functions. This is pretty slow then:

julia> kernel() = return
kernel (generic function with 1 method)

julia> include("pkg/GPUCompiler/test/definitions/native.jl")
native_code_execution (generic function with 1 method)

julia> job, kwargs = native_job(kernel, Tuple{})
(CompilerJob of function kernel() for NativeCompilerTarget, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())

julia> @benchmark GPUCompiler.compile(:asm, job)
BenchmarkTools.Trial: 
  memory estimate:  389.70 KiB
  allocs estimate:  7019
  --------------
  minimum time:     7.027 ms (0.00% GC)
  median time:      7.784 ms (0.00% GC)
  mean time:        7.805 ms (0.28% GC)
  maximum time:     11.175 ms (24.24% GC)
  --------------
  samples:          640
  evals/sample:     1

Unsupported `jl_apply_generic` IR pattern

using Test

using CUDAdrv, CUDAnative

function vadd(a::AbstractArray{T}) where {T}
    Base.unsafe_store!(unsafe_convert(CUDAnative.DevicePtr{T,AS.Global}, C_NULL), 1, 1, Val(1))
    return
end

function main()
    tt = Tuple{CuDeviceArray{Float32,1,AS.Global}}
    CUDAnative.code_sass(vadd, tt)
end
┌ Warning: Decoding arguments to jl_apply_generic failed, please file a bug with a reproducer.
│   inst =   %11 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)* %8, %jl_value_t addrspace(10)** nonnull %.sub, i32 2), !dbg !33
│   bb =
│    
│    julia_vadd_38.exit:                               ; preds = %found.i, %err.i
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613733290384 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %.sub, align 8, !dbg !33
│      %10 = getelementptr inbounds [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 1, !dbg !33
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613799821008 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %10, align 8, !dbg !33
│      %11 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)* %8, %jl_value_t addrspace(10)** nonnull %.sub, i32 2), !dbg !33
│      store %jl_value_t addrspace(10)* %11, %jl_value_t addrspace(10)** %.sub, align 8, !dbg !33
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613704188000 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %10, align 8, !dbg !33
│      %12 = getelementptr inbounds [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 2, !dbg !33
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613704188000 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %12, align 8, !dbg !33
│      %13 = getelementptr inbounds [4 x %jl_value_t addrspace(10)*], [4 x %jl_value_t addrspace(10)*]* %1, i64 0, i64 3, !dbg !33
│      store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613798418304 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %13, align 8, !dbg !33
│      %14 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140613817508064 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** nonnull %.sub, i32 4), !dbg !33
│      ret void
│    
└ @ CUDAnative ~/Julia/pkg/CUDAnative/src/compiler/validation.jl:222
ERROR: InvalidIRError: compiling vadd(CuDeviceArray{Float32,1,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported use of an undefined name (use of 'unsafe_convert')
Stacktrace:
 [1] vadd at /home/tbesard/Julia/wip2.jl:6
Reason: unsupported call to the Julia runtime (call to jl_apply_generic)
Stacktrace:
 [1] vadd at /home/tbesard/Julia/wip2.jl:6
Reason: unsupported dynamic function invocation (call to unsafe_store!)
Stacktrace:
 [1] vadd at /home/tbesard/Julia/wip2.jl:6
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/tbesard/Julia/pkg/CUDAnative/src/compiler/validation.jl:114
 [2] macro expansion at /home/tbesard/Julia/pkg/CUDAnative/src/compiler/driver.jl:188 [inlined]
 [3] macro expansion at /home/tbesard/Julia/pkg/depot/packages/TimerOutputs/7zSea/src/TimerOutput.jl:216 [inlined]
 [4] #codegen#130(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/tbesard/Julia/pkg/CUDAnative/src/compiler/driver.jl:186
 [5] codegen at /home/tbesard/Julia/pkg/CUDAnative/src/compiler/driver.jl:57 [inlined]
 [6] code_sass(::Base.TTY, ::CUDAnative.CompilerJob) at /home/tbesard/Julia/pkg/CUDAnative/src/reflection.jl:110
 [7] #code_sass#189(::VersionNumber, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.code_sass), ::Base.TTY, ::Any, ::Any) at /home/tbesard/Julia/pkg/CUDAnative/src/reflection.jl:100
 [8] code_sass(::Base.TTY, ::Any, ::Any) at /home/tbesard/Julia/pkg/CUDAnative/src/reflection.jl:98
 [9] #code_sass#190(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.code_sass), ::Any, ::Any) at /home/tbesard/Julia/pkg/CUDAnative/src/reflection.jl:134
 [10] code_sass at /home/tbesard/Julia/pkg/CUDAnative/src/reflection.jl:134 [inlined]
 [11] main() at /home/tbesard/Julia/wip2.jl:12
 [12] top-level scope at REPL[5]:1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.