sciml / diffeqgpu.jl Goto Github PK

GPU-acceleration routines for DifferentialEquations.jl and the broader SciML scientific machine learning ecosystem

Home Page: https://docs.sciml.ai/DiffEqGPU/stable/

License: MIT License

Julia 100.00%

gpu gpu-parallelism differentialequations differential-equations ode sde dae dde neural-ode neural-differential-equations

diffeqgpu.jl's Introduction

DiffEqGPU

This library is a component package of the DifferentialEquations.jl ecosystem. It includes functionality for making use of GPUs in the differential equation solvers.

The two ways to accelerate ODE solvers with GPUs

There are two very different ways that one can accelerate an ODE solution with GPUs. There is one case where u is very big and f is very expensive but very structured, and you use GPUs to accelerate the computation of said f. The other use case is where u is very small but you want to solve the ODE f over many different initial conditions (u0) or parameters p. In that case, you can use GPUs to parallelize over different parameters and initial conditions. In other words:

Type of Problem	SciML Solution
Accelerate a big ODE	Use CUDA.jl's CuArray as `u0`
Solve the same ODE with many `u0` and `p`	Use DiffEqGPU.jl's `EnsembleGPUArray` and `EnsembleGPUKernel`

Supported GPUs

SciML's GPU support extends to a wide array of hardware, including:

GPU Manufacturer	GPU Kernel Language	Julia Support Package	Backend Type
NVIDIA	CUDA	CUDA.jl	`CUDA.CUDABackend()`
AMD	ROCm	AMDGPU.jl	`AMDGPU.ROCBackend()`
Intel	OneAPI	OneAPI.jl	`oneAPI.oneAPIBackend()`
Apple (M-Series)	Metal	Metal.jl	`Metal.MetalBackend()`

For this tutorial we will demonstrate the CUDA backend for NVIDIA GPUs, though any of the other GPUs can be used by simply swapping out the backend choice.

Example of Within-Method GPU Parallelism

using OrdinaryDiffEq, CUDA, LinearAlgebra
u0 = cu(rand(1000))
A = cu(randn(1000, 1000))
f(du, u, p, t) = mul!(du, A, u)
prob = ODEProblem(f, u0, (0.0f0, 1.0f0)) # Float32 is better on GPUs!
sol = solve(prob, Tsit5())

Example of Parameter-Parallelism with GPU Ensemble Methods

using DiffEqGPU, CUDA, OrdinaryDiffEq, StaticArrays

function lorenz(u, p, t)
    σ = p[1]
    ρ = p[2]
    β = p[3]
    du1 = σ * (u[2] - u[1])
    du2 = u[1] * (ρ - u[3]) - u[2]
    du3 = u[1] * u[2] - β * u[3]
    return SVector{3}(du1, du2, du3)
end

u0 = @SVector [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = @SVector [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)

@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(CUDA.CUDABackend()),
    trajectories = 10_000, adaptive = false, dt = 0.1f0)

Benchmarks

Curious about our claims? See https://github.com/utkarsh530/GPUODEBenchmarks for comparsion of our GPU solvers against CPUs and GPUs implementation in C++, JAX and PyTorch.

Citation

If you are using DiffEqGPU.jl in your work, consider citing our paper:

@article{utkarsh2024automated,
  title={Automated translation and accelerated solving of differential equations on multiple GPU platforms},
  author={Utkarsh, Utkarsh and Churavy, Valentin and Ma, Yingbo and Besard, Tim and Srisuma, Prakitr and Gymnich, Tim and Gerlach, Adam R and Edelman, Alan and Barbastathis, George and Braatz, Richard D and others},
  journal={Computer Methods in Applied Mechanics and Engineering},
  volume={419},
  pages={116591},
  year={2024},
  publisher={Elsevier}
}

diffeqgpu.jl's People

Contributors

Stargazers

Watchers

diffeqgpu.jl's Issues

GPU-accelerated parareal

https://ieeexplore.ieee.org/document/8248220
https://pdfs.semanticscholar.org/767f/904f0c459b3184a2c29fc05488bfd4d4aceb.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.723.4562&rep=rep1&type=pdf
https://hal.archives-ouvertes.fr/hal-01510888/document
http://eprints.whiterose.ac.uk/90543/1/1409.8563v2.pdf
https://www.sciencedirect.com/science/article/pii/S1877050910000815

Ensemble simulations produce solutions with different dimensionality between CPU and GPU (vec's the solution)

Solving a matrix differential equation yields a matrix solution (as expected) on a CPU, but on a GPU a vector solution is returned.

Suppose that the following is executed on the REPL:

using LinearAlgebra, DifferentialEquations, DiffEqGPU

dim = 2;
t0 = 0.0f0;
tf = 1.0f0;
steps = 11;
jobs = Float32[-1.0 1.0 1.0 1.0 0.0 0.0; -2.0 1.0 1.0 1.0 0.0 0.0];

tstep = (tf-t0)/(steps-1)

function du!(du, u, p, t)
    @inbounds begin
        du[1,1] = 1;#lower_d[1,1];
        du[1,2] = 1;#lower_d[1,2];
        du[2,1] = 1;#lower_d[2,1];
        du[2,2] = 1;#lower_d[2,2];
    end
    nothing
end

# Create the initial state
u0 = zeros(Complex{Float32},dim,dim);
u0[1,1] = 1;

prob = ODEProblem{true}(du!,u0,(t0,tf),jobs[1,:]);

# Define how the problem is modified for each job
# i defines the trajectory number in the ensembleproblem
# repeat is true if the problem is being repeated
function prob_func(prob,i,repeat)
    return remake(prob,p=jobs[i,:])
end

ensemble_prob = EnsembleProblem(prob,
                                prob_func=prob_func,
                                safetycopy=false)

The system is then solved on the CPU with:

sol = solve(ensemble_prob,
                   Tsit5(),
                   trajectories=size(jobs,1),
                   saveat=tstep,
                   dense=false,
                   save_everystep=false,
                   abstol=1e-10,
                   reltol=1e-7,
                   maxiters=1e5);

Inspecting an element of the solution yields:

julia> sol.u[1]
retcode: Success
Interpolation: 1st order linear
t: 11-element Array{Float32,1}:
 0.0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1.0
u: 11-element Array{Array{Complex{Float32},2},1}:
 [1.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im]
 [1.1000003f0 + 0.0f0im 0.1f0 + 0.0f0im; 0.1f0 + 0.0f0im 0.1f0 + 0.0f0im]
 [1.2000004f0 + 0.0f0im 0.2f0 + 0.0f0im; 0.2f0 + 0.0f0im 0.2f0 + 0.0f0im]
 [1.3000004f0 + 0.0f0im 0.29999998f0 + 0.0f0im; 0.29999998f0 + 0.0f0im 0.29999998f0 + 0.0f0im]
 [1.4000005f0 + 0.0f0im 0.40000004f0 + 0.0f0im; 0.40000004f0 + 0.0f0im 0.40000004f0 + 0.0f0im]
 [1.5000004f0 + 0.0f0im 0.5f0 + 0.0f0im; 0.5f0 + 0.0f0im 0.5f0 + 0.0f0im]
 [1.6000003f0 + 0.0f0im 0.6f0 + 0.0f0im; 0.6f0 + 0.0f0im 0.6f0 + 0.0f0im]
 [1.7000003f0 + 0.0f0im 0.7f0 + 0.0f0im; 0.7f0 + 0.0f0im 0.7f0 + 0.0f0im]
 [1.8000002f0 + 0.0f0im 0.79999995f0 + 0.0f0im; 0.79999995f0 + 0.0f0im 0.79999995f0 + 0.0f0im]
 [1.9000002f0 + 0.0f0im 0.8999999f0 + 0.0f0im; 0.8999999f0 + 0.0f0im 0.8999999f0 + 0.0f0im]
 [2.0000002f0 + 0.0f0im 1.0f0 + 0.0f0im; 1.0f0 + 0.0f0im 1.0f0 + 0.0f0im]

u0 and du are matrices of Complex{Float32}, as are the elements of the solution array.

Running this on the GPU by changing ensemblealg yields:

julia> sol = solve(ensemble_prob,
                   Tsit5(), EnsembleGPUArray(),
                   trajectories=size(jobs,1),
                   saveat=tstep,
                   dense=false,
                   save_everystep=false,
                   abstol=1e-10,
                   reltol=1e-7,
                   maxiters=1e5);

julia> sol.u[1]
retcode: Success
Interpolation: 1st order linear
t: 11-element Array{Float32,1}:
 0.0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1.0
u: 11-element Array{SubArray{Complex{Float32},1,Array{Complex{Float32},2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true},1}:
 [1.0f0 + 0.0f0im, 0.0f0 + 0.0f0im]
 [1.1000003f0 + 0.0f0im, 0.1f0 + 0.0f0im]
 [1.2000004f0 + 0.0f0im, 0.2f0 + 0.0f0im]
 [1.3000004f0 + 0.0f0im, 0.29999998f0 + 0.0f0im]
 [1.4000005f0 + 0.0f0im, 0.40000004f0 + 0.0f0im]
 [1.5000004f0 + 0.0f0im, 0.5f0 + 0.0f0im]
 [1.6000003f0 + 0.0f0im, 0.6f0 + 0.0f0im]
 [1.7000003f0 + 0.0f0im, 0.7f0 + 0.0f0im]
 [1.8000002f0 + 0.0f0im, 0.79999995f0 + 0.0f0im]
 [1.9000002f0 + 0.0f0im, 0.8999999f0 + 0.0f0im]
 [2.0000002f0 + 0.0f0im, 1.0f0 + 0.0f0im]

The elements of the output seem to be an array of vectors, not matrices.

every solution in the runtest example is exactly the same

Using the lorenz setup in the runtest file I noticed that all of the solutions are the same.
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0) filter(x -> x.u != sol.u[1].u, sol.u) # 0 element array

Continuous Callback speed

These two findall calls end up doing scalar indexing and thus are tremendously slow on the GPU. We need a better way to do those operations.

https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/v6.3.0/src/callbacks.jl#L304-L313

Event handling

In order to handle events in this framework we will need to setup event conversions. ContinuousCallbacks should transform into a VectorContinuousCallback which does a rootfind per individual problem and then triggers the event for the one that triggers. DiscreteCallbacks should check the condition on each and then apply only on those that are satisfied each time. The latter can be done by making a DiscreteCallback that always fires but uses a cache array for handling the affect! application

References

https://arxiv.org/pdf/1611.02274.pdf

GPU Parallel Ensemble Simulations with generic linear algebra

Hello,

I'm trying to solve and ODE with my GPU. I tested the example code present in the DiffEqGPU.js repository, and it worked fine.

I have the following function to implement in the ODE Problem

function drho_dt(rho, p, t)
    A, w_l= p
    return ( L + A * sin(w_l * t) * L_t ) * rho
end

or equivalently

function drho_dt(drho, rho, p, t)
    A, w_l = p
    mul!(drho, L + A * sin(w_l * t) * L_t, rho)
    return nothing
end

where A and w_l are two numbers, while L and L_t are two global matrices. Following the documentation, I used the following code to run a Parallel Ensemble:

A = 0.05
w_l = 1.0

p = [A, w_l]

tspan = (0.0, 10.0 / gam_q)

w_l_l = LinRange{Float64}(0.1, 1.9, 10)

function prob_func(prob,i,repeat)
  remake(prob,p=[prob.p[1], w_l_l[i]])
end

prob = ODEProblem(drho_dt, rho0_vec, tspan, p)
ensemble_prob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
@time sim = solve(ensemble_prob, BS3(), METHOD!!, trajectories=length(w_l_l))

where METHOD!! can be one of the methods, e.g. EnsembleSerial(), EnsembleThreads(), EnsembleCPUArray() or EnsembleGPUArray().

The problem is divided in two cases:

When L and L_t are sparse matrices. In this case all the methods work, except for EnsembleGPUArray(). In this case the error is

InvalidIRError: compiling kernel gpu_gpu_kernel_oop(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] overdub
   @ ./In[8]:62
 [2] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25
 [3] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [4] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:27
 [2] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [3] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0

Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Cassette.overdub), Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:319 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZQ0rt/src/TimerOutput.jl:236 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:317 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:317
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/cache.jl:89
  [8] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}; name::String, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:288
  [9] macro expansion
    @ ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:102 [inlined]
 [10] (::KernelAbstractions.Kernel{CUDAKernels.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel_oop)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Int64, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/8wtKq/src/CUDAKernels.jl:194
 [11] SciML/DifferentialEquations.jl#55
    @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414 [inlined]
 [12] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
 [13] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CuArray{ComplexF64, 2}, Nothing, Float64, CuArray{Float64, 2}, Float64, Float64, Float64, Float64, Vector{CuArray{ComplexF64, 2}}, ODESolution{ComplexF64, 3, Vector{CuArray{ComplexF64, 2}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CuArray{ComplexF64, 2}}, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, CuArray{ComplexF64, 2}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
 [14] __init(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
 [15] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
 [16] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [17] solve_up(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{ComplexF64, 2}, p::CuArray{Float64, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [18] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [19] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [20] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [21] macro expansion
    @ ./timing.jl:287 [inlined]
 [22] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [23] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [24] top-level scope
    @ ./timing.jl:210 [inlined]
 [25] top-level scope
    @ ./In[9]:0
 [26] eval
    @ ./boot.jl:360 [inlined]
 [27] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

When L and L_t are dense matrices. In this case not even EnsembleCPUArray() works, but the others yes. The error for the GPU method should be the same. While the error for the CPU one is

TaskFailedException

    nested task error: conversion to pointer not defined for Base.Experimental.Const{ComplexF64, 2}
    Stacktrace:
      [1] error(::String)
        @ ./error.jl:33 [inlined]
      [2] overdub
        @ ./error.jl:33 [inlined]
      [3] unsafe_convert(::Type{Ptr{ComplexF64}}, ::Base.Experimental.Const{ComplexF64, 2})
        @ ./pointer.jl:67 [inlined]
      [4] overdub
        @ ./pointer.jl:67 [inlined]
      [5] unsafe_convert(::Type{Ptr{ComplexF64}}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ ./subarray.jl:427 [inlined]
      [6] overdub
        @ ./subarray.jl:427 [inlined]
      [7] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:704 [inlined]
      [8] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:544 [inlined]
      [9] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, ::Bool, ::Bool)
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [10] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [11] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [12] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [13] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:51 [inlined]
     [14] overdub(::Cassette.Context{nametype(CPUCtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, ::typeof(*), ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ Cassette ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
     [15] overdub
        @ ./In[11]:62 [inlined]
     [16] macro expansion
        @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25 [inlined]
     [17] overdub
        @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:265 [inlined]
     [18] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:157
     [19] __run(obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:130
     [20] (::KernelAbstractions.var"#33#34"{Tuple{KernelAbstractions.NoneEvent}, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, Tuple{Int64}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, KernelAbstractions.NDIteration.NoDynamicCheck}})()
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:22

Stacktrace:
  [1] wait
    @ ./task.jl:322 [inlined]
  [2] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:65 [inlined]
  [3] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:64 [inlined]
  [4] (::DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)})(du::Matrix{ComplexF64}, u::Matrix{ComplexF64}, p::Matrix{Float64}, t::Float64)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414
  [5] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
  [6] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, Matrix{ComplexF64}, Nothing, Float64, Matrix{Float64}, Float64, Float64, Float64, Float64, Vector{Matrix{ComplexF64}}, ODESolution{ComplexF64, 3, Vector{Matrix{ComplexF64}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Matrix{ComplexF64}}, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, Matrix{ComplexF64}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
  [7] __init(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
  [8] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
  [9] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [10] solve_up(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::Matrix{ComplexF64}, p::Matrix{Float64}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [11] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [12] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [13] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [14] macro expansion
    @ ./timing.jl:287 [inlined]
 [15] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [16] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [17] top-level scope
    @ ./timing.jl:210 [inlined]
 [18] top-level scope
    @ ./In[11]:0
 [19] eval
    @ ./boot.jl:360 [inlined]
 [20] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

EnsembleMPGOS

@FerencHegedus @nnagyd have been putting a lot of work into tuning ensemble GPU parallel CUDA kernel explicit solvers for non-stiff ODEs so we should probably should make use of that. Our GPU stack can be used to generate device code from Julia functions, so that fixes what would probably be the most major hurdle. @FerencHegedus, could you help suggest a way to wrap this, like what level should we interact with the library that would be most beneficial? I think we can have a fixed driver code a la

https://github.com/FerencHegedus/Massively-Parallel-GPU-ODE-Solver/blob/master/Tutorials_SingleSystem_PerThread/T2_Lorenz/Lorenz.cu

that we pass things into, and then generate and compile the system code

https://github.com/FerencHegedus/Massively-Parallel-GPU-ODE-Solver/blob/master/Tutorials_SingleSystem_PerThread/T2_Lorenz/Lorenz_SystemDefinition.cuh

@vchuravy might need to help out here, and we might need to get MPGOS onto binary builder for this to work.

Optimize Wfact!_t_kernel

Working with the following "application":

using DiffEqBase, CuArrays, DiffEqGPU, CUDAdrv, NVTX
@inline DiffEqBase.ODE_DEFAULT_NORM(u::CuArray, t) = sqrt(real(sum(abs2.(u))) / length(u))

using OrdinaryDiffEq
using ModelingToolkit

pre = "$(@__DIR__)/PfizerModels/Leucine/juliacode/src"
include("$pre/assets/model.jl")
include("$pre/assets/utils.jl")
include("$pre/assets/jac.jl")

function tgrad(T,u,p,t)
    T .= 0
end

function makeprob(u0, p, setup, tspan = (0f0, 1439.9f0))
    model = BCAA(bcaa_metc_l, setup)
    _prob = ODEProblem(model, u0, tspan, p)
    de = modelingtoolkitize(_prob)
    prob = remake(_prob, f=ODEFunction(model, jac=jac, tgrad=tgrad))
end

function main(;N=3000, profile=false)
    u0 = [0.1200,0.1900,0.1200,0.0300,0.0385,0.0060, 0, 0,0.0090,0, 0, 0, 0,0.2880]
    p = [174.138432437717,1.91596338444106,3.45820080665769,17.9318377207339,
        6.13845610726550e-06,0.189227379600121,5.53336565859735e-05,
        0.119551107201354,0.0927596768936607,0.566466180074604,25.3458101362269,
        0.000337350787686603,8.57661263373792e05,10.1526260953229,0.687797528338173,
        0.00219764585496342,0.513837957607215,0.000951294266478801,
        0.000431586761580820,0.000112535302019660,35.7984144023875,
        0.0602424082459188,0.140189155737861,0.0308074873227543,1.04514780842042,
        0.0803038079375722,3.62926904542732,10.0085592733504,0.00161738187172177,
        0.000851233336164325,22.6644054038921,0.000207500339633419,0.922925408003345,
        0.212643379412943,0.0332819230028071,121.382236788010,23.7869507541515,
        0.350073941143997,0.00709619517451934,0.0240008835846888,0.185259412493853,
        2137.11186749182,2.49767191527481,1.90765399214515,1.77560190111152,
        0.00703258883973330,4.79595131460171,4.48203433161904,0.00227233514285374,
        0.408039934840477,0.0331408109662421,0.0148602852990811,0.405965675417341,
        2.05071468941456e-05,6.12240696132630,9.92424056410858,833.101713187030]
    setup = (0,     0,     0,     0,     3,     1)
    prob = makeprob(u0, p, setup)
    pfunc(prob,i,repeat) = remake(prob,p=(randn(57)/10 .+ 1).* prob.p)
    ensembleprob = EnsembleProblem(prob,prob_func = pfunc)

    if profile
        solve(ensembleprob,Rodas5(linsolve=LinSolveGPUSplitFactorize()),EnsembleGPUArray(),abstol=1e-4,reltol=1e-5,dt=1.0,trajectories=100)
        NVTX.@activate CUDAdrv.@profile solve(ensembleprob,Rodas5(linsolve=LinSolveGPUSplitFactorize()),EnsembleGPUArray(),abstol=1e-4,reltol=1e-5,dt=1.0,trajectories=N)
    else
        @time solve(ensembleprob,Rodas5(),EnsembleThreads(),abstol=1e-4,reltol=1e-5,dt=1.0,trajectories=N)
        @time solve(ensembleprob,Rodas5(linsolve=LinSolveGPUSplitFactorize()),EnsembleGPUArray(),abstol=1e-4,reltol=1e-5,dt=1.0,trajectories=N)
    end

    return
end

isinteractive() || main(profile=true)

I observed that with large trajectory numbers, the Wfact!_t_kernel kernel started dominating GPU execution. This is a heavy kernel, with a low occupancy due to high register pressure. However, the kernel is made up of two parts: calculating the Jacobian, and performing LU factorization. The first part is register heavy, but executes quickly. The second part is much more compute intensive, while being constrained by the high register requirement of the Jacobian calculation. So it's beneficial to split the kernel into two parts:

# high register usage, but low execution time
function Wfact!_t_kernel_1(jac,W,u,p,gamma,t)
    len = size(u,1)
    @loop for i in (1:size(u,2); (blockIdx().x-1) * blockDim().x + threadIdx().x)
        _W = @inbounds @view(W[:, :, i])

        # Compute the Jacobian
        @views @inbounds jac(_W,u[:,i],p[:,i],t)
        @inbounds for i in 1:len
            _W[i, i] = -inv(gamma) + _W[i, i]
        end
        nothing
    end
    return nothing
end

# low register usage, but high execution time
function Wfact!_t_kernel_2(jac,W,u,p,gamma,t)
    len = size(u,1)
    @loop for i in (1:size(u,2); (blockIdx().x-1) * blockDim().x + threadIdx().x)
        _W = @inbounds @view(W[:, :, i])

        # Compute the lufact!
        generic_lufact!(_W, len)
        nothing
    end
    return nothing
end

Appropriately adjusting the launch configuration and invocation, while using vchuravy/GPUifyLoops.jl#100 to make more efficient use of the GPU, this improves run-time by about 10%:

# after
julia> main(N=15000)
  7.564607 seconds (68.91 M allocations: 3.980 GiB, 16.82% gc time)
 10.248540 seconds (15.86 M allocations: 2.679 GiB, 4.60% gc time)

# before
julia> main(N=15000)
  7.598313 seconds (68.90 M allocations: 3.979 GiB, 17.33% gc time)
 11.234530 seconds (17.09 M allocations: 2.689 GiB, 4.21% gc time)

However, LU factorization is something that has already been implemented in CUSOLVER and CUBLAS, and there's even a batched interface! So let's replace the second kernel altogether:

julia> main(N=15000)
  6.868722 seconds (68.88 M allocations: 3.978 GiB, 14.86% gc time)
  5.041791 seconds (9.34 M allocations: 2.385 GiB, 8.66% gc time)

Since this is very performance sensitive code though, we can't use the wrappers in CuArrays which apparently display several inefficiencies. Luckily we have all the low-level APIs readily available:

using CUDAdrv: CuPtr, CU_NULL, Mem, CuDefaultStream
using CuArrays: CUBLAS

_Wfact!_t = let jac=probs[1].f.jac
    function (W,u,p,gamma,t)
        version = u isa CuArray ? CUDA() : CPU()
        @launch version Wfact!_t_kernel_1(jac,W,u,p,gamma,t)

        # original, slow kernel
        # @launch version Wfact!_t_kernel_2(jac,W,u,p,gamma,t)

        # CUBLAS batched interface
        # CuArrays.CUBLAS.getrf_batched!([view(W, :, :, i) for i in 1:size(W, 3)], false)
        # slow for two reasons:
        # - for some reason the array-of-views construction takes 4ms ?!
        # - the copy within CUBLAS.device_batch is synchronizing

        NVTX.@range "lufact setup" begin
            T = eltype(W)
            batchsize = size(W, 3)
            batch = size(W)[1:2]
            @assert batch[1] == batch[2]
            A = Vector{CuPtr{T}}(undef, batchsize)
            ptr = Base.unsafe_convert(CuPtr{T}, Base.cconvert(CuPtr{T}, W))
            for i in 1:batchsize
                A[i] = ptr + (i-1) * prod(batch) * sizeof(T)
            end
            B = CuVector{eltype(A)}(undef, batchsize)
            Mem.copy!(B.buf, pointer(A), sizeof(A); async=true, stream=CuDefaultStream())
            lda = max(1, stride(W,2))
            info = CuArray{Cint}(undef, length(A))
        end
        NVTX.@range "lufact" CUBLAS.cublasSgetrfBatched(CUBLAS.handle(), batch[1], B, lda, CU_NULL, info, batchsize)
    end

Now, beyond 30000 trajectories the CPU catches up with the GPU again, so presumable there's another kernel that starts dominating the execution and would benefit from an optimization like this. I'd suggest to always try and use existing wrappers rather than re-implementing functionality, they are generally well-optimized.

Dual numbers stall

using OrdinaryDiffEq, DiffEqGPU, ForwardDiff, Test, Unitful
using CUDA
CUDA.allowscalar(false)

function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = p[1]*(u[2]-u[1])
     du[2] = u[1]*(p[2]-u[3]) - u[2]
     du[3] = u[1]*u[2] - p[3]*u[3]
 end
 nothing
end

u0 = [ForwardDiff.Dual(1f0,(1.0,0.0,0.0));ForwardDiff.Dual(0f0,(0.0,1.0,0.0));ForwardDiff.Dual(0f0,(0.0,0.0,1.0))]
tspan = (0.0f0,100.0f0)
p = (10.0f0,28.0f0,8/3f0)
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

This worked until the CUDA update, so it's either CUDA.jl or KernelAbstractions v0.3 @maleadt @vchuravy. It doesn't error, it just completely stalls and seems to be stuck forever. But every time I kill it, it seems to be in the mapreduce kernel, so I'm going to guess that a major regression happened there?

InterruptException:
macro expansion at mapreduce.jl:173 [inlined]
mapreducedim!(::Function, ::Function, ::CuArray{Float64,3,CuArray{Float64,2,Nothing}}, ::CuArray{Float64,3,Nothing}; init::Float64) at highlevel.jl:83
(::GPUArrays.var"#mapreducedim!##kw")(::NamedTuple{(:init,),Tuple{Float64}}, ::typeof(GPUArrays.mapreducedim!), ::Function, ::Function, ::CuArray{Float64,3,CuArray{Float64,2,Nothing}}, ::CuArray{Float64,3,Nothing}) at highlevel.jl:81
macro expansion at mapreduce.jl:241 [inlined]
mapreducedim!(::Function, ::Function, ::CuArray{Float64,2,Nothing}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(abs2),Tuple{CuArray{Float64,2,Nothing}}}; init::Float64) at highlevel.jl:83
(::GPUArrays.var"#mapreducedim!##kw")(::NamedTuple{(:init,),Tuple{Float64}}, ::typeof(GPUArrays.mapreducedim!), ::Function, ::Function, ::CuArray{Float64,2,Nothing}, ::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2},Tuple{Base.OneTo{Int64},Base.OneTo{Int64}},typeof(abs2),Tuple{CuArray{Float64,2,Nothing}}}) at highlevel.jl:81
_mapreduce(::Function, ::Function, ::CuArray{Float64,2,Nothing}; dims::Function, init::Nothing) at mapreduce.jl:62
_mapreduce at mapreduce.jl:34 [inlined]
#mapreduce#25 at mapreduce.jl:28 [inlined]
mapreduce at mapreduce.jl:28 [inlined]
_sum at reducedim.jl:657 [inlined]
#sum#584 at reducedim.jl:653 [inlined]
sum at reducedim.jl:653 [inlined]
ODE_DEFAULT_NORM at init.jl:209 [inlined]
perform_step!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,true,CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},Nothing,Float32,CuArray{Float32,2,Nothing},Float32,Float32,Float32,Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},ODESolution{ForwardDiff.Dual{Nothing,Float64,3},3,Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},Nothing,Nothing,Array{Float32,1},Array{Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},1},ODEProblem{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},Tuple{Float32,Float32},true,CuArray{Float32,2,Nothing},ODEFunction{true,DiffEqGPU.var"#26#30"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{true,DiffEqGPU.var"#26#30"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},Array{Float32,1},Array{Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},1},OrdinaryDiffEq.Tsit5Cache{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float64,Float32}}},DiffEqBase.DEStats},ODEFunction{true,DiffEqGPU.var"#26#30"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5Cache{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float64,Float32}},OrdinaryDiffEq.DEOptions{ForwardDiff.Dual{Nothing,Float64,3},ForwardDiff.Dual{Nothing,Float64,3},Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(LinearAlgebra.opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Tuple{},Float32,Tuple{}},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},ForwardDiff.Dual{Nothing,Float64,3},Nothing,OrdinaryDiffEq.DefaultInit}, ::OrdinaryDiffEq.Tsit5Cache{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float64,Float32}}, ::Bool) at low_order_rk_perform_step.jl:658
perform_step! at low_order_rk_perform_step.jl:628 [inlined]
solve!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,true,CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},Nothing,Float32,CuArray{Float32,2,Nothing},Float32,Float32,Float32,Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},ODESolution{ForwardDiff.Dual{Nothing,Float64,3},3,Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},Nothing,Nothing,Array{Float32,1},Array{Array{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},1},1},ODEProblem{CuArray{ForwardDiff.Dual{Nothing,Float64,3},2,Nothing},Tuple{Float32,Float32},true,CuArray{Float32,2,Nothing},ODEFunction{true,DiffEqGPU.var"#26#30"{...

Newton-Krylov handling

In order to get Newton-Krylov working, we will need the GMRES to be GPU-compatible. When we tried this with GMRES from IterativeSolvers.jl, we ran into JuliaLinearAlgebra/IterativeSolvers.jl#251 . Instead of going the IterativeSolvers.jl route, we could make use of KrylovKit.jl, which is already GPU compatible. That said, we need to make it DiffEq compatible, which is what's being attempted in SciML/DiffEqBase.jl#345 . But since KrylovKit doesn't support preconditioners yet, it's somewhat of a dead-end given the performance tests we did with @YingboMa . We kind of need to use our ComposePreconditioner to get any good performance:

https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/linear_nonlinear.jl#L144-L169

This leaves somewhat of a dead-end without upstream work on either of those packages.

Compatibility with out of place definitions

I'm having trouble generating a trivial example for Ensemble{CPU,GPU}Array. If I use the default EnsembleThreads() this runs fine, but uncommenting either of the DiffEqGPU-provided algs in my example causes failures. EnsembleGPUArray fails with unsupported dynamic function invocation (call to myode), and EnsembleCPUArray fails with LoadError: MethodError: no method matching myode(::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{Float64,1,Array{Float64,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::SubArray{DiffEqBase.NullParameters,1,Array{DiffEqBase.NullParameters,2},Tuple{Base.Slice{Base.OneTo{Int64}},Int64},true}, ::Float64). Am I doing something wrong here?

using OrdinaryDiffEq, DiffEqGPU

myode(u, p, t) = -u

u0 = 1.
prob = ODEProblem(myode, u0, (0., 5.) )

eprob = EnsembleProblem(prob)
sols = @time solve(eprob,
                   Tsit5(),
                   #EnsembleCPUArray(),
                   #EnsembleGPUArray(),
                   trajectories=10
                   )

This is using Julia v1.2.0, DiffEqGPU v0.4.0, OrdinaryDiffEq v5.20.1.

GPU Parallel Ensemble simulation with random time span

Hi, for my application I would like to simulate a large number of trajectories (1000 to 20000) of the same system with different parameters but also with different time spans for a Monte Carlo simulation. However when attempting this, I get an error from DiffEqGPU.jl
ERROR: LoadError: AssertionError: all((p->begin #= /home/tangui/.julia/packages/DiffEqGPU/Ibo20/src/DiffEqGPU.jl:278 =# p.tspan == (probs[1]).tspan end), probs)
I understand that the function checks if all problems have the same tspan, but why is that ? I guess this has to do with some of the way the problem is broadcasted to the GPU under some fixed array shape. I think I could greatly benefit from switching to computing on GPU, especially for the reduced precision performance so DiffEqGPU seemed to be a good fit.
Many thanks !
Here is the code to reproduce the error :

using DiffEqGPU, OrdinaryDiffEq

# global variables to be set from python side through Julia "Main" namespace
N_real = 5000
N_grid = 256
T = 150+273
m87 = 1.44316060e-25
k_B = 1.38064852e-23
window = 2.5e-3
Gamma = 1.0 + 0*im
Omega13 = 1.0 + 0*im
Omega23 = 1.0 + 0*im
gamma21tilde = 1.0 + 0*im
gamma31tilde = 1.0 + 0*im
gamma32tilde = 1.0 + 0*im
waist = 1.0e-3 + 0*im
r0 = 1.0 + 0*im
x0 = ComplexF32[5/8 + 0*im, 3/8+ 0*im, 0+ 0*im, 0+ 0*im, 0+ 0*im, 0+ 0*im, 0+ 0*im, 0+ 0*im]
k = Float32(2*pi/780.241e-9)
v = 40.0f0

function choose_points()::Tuple{Int32, Int32, Int32, Int32}
    edges = Array{Tuple{Int32, Int32}, 1}(undef, 4*N_grid)
    for i in 1:N_grid
        edges[i] = (1, i)
        edges[i+N_grid] = (N_grid, i)
        edges[i+2*N_grid] = (i, 1)
        edges[i+3*N_grid] = (i, N_grid)
    end
    iinit, jinit = edges[rand(1:length(edges), 1)][1]
    ifinal, jfinal = iinit, jinit
    cdtn = (ifinal == iinit) || (jfinal == jinit)
    while cdtn
        ifinal, jfinal = edges[rand(1:length(edges), 1)][1]
        cdtn = (ifinal == iinit) || (jfinal == jinit)
    end
    return (iinit, jinit, ifinal, jfinal)
end

function draw_vz(v::Float32)::Float32
    vz = abs(2*v)
    while abs(vz) > abs(v)
        vz = randn()*sqrt(k_B*T/m87)
    end
    return vz
end

function f!(dy::Array{ComplexF32, 1}, x::Array{ComplexF32, 1},
            p::Array{ComplexF32, 1}, t::Float32)
    v, u0, u1, xinit, yinit = p[1], p[2], p[3], p[4], p[5]
    Gamma, Omega13, Omega23 = p[6], p[7], p[8]
    gamma21tilde, gamma31tilde, gamma32tilde = p[9], p[10], p[11]
    waist, r0 = p[12], p[13]
    r_sq = (xinit+u0*v*t - r0)^2 + (yinit+u1*v*t - r0)^2
    Om23 = Omega23 * exp(-r_sq/(2*waist*waist))
    Om13 = Omega13 * exp(-r_sq/(2*waist*waist))
    dy[1] = (-Gamma/2)*x[1]-(Gamma/2)*x[2]+(im*conj(Om13)/2)*x[5]-(im*Om13/2)*x[6]+Gamma/2
    dy[2] = (-Gamma/2)*x[1]-(Gamma/2)*x[2]+(im*conj(Om23)/2)*x[7]-(im*Om23/2)*x[8]+Gamma/2
    dy[3] = -gamma21tilde*x[3]+(im*conj(Om23)/2)*x[5]-(im*Om13/2)*x[8]
    dy[4] = -conj(gamma21tilde)*x[4] - (im*Om23/2)*x[6] + (im*conj(Om13)/2)*x[7]
    dy[5] = im*Om13*x[1] + (im*Om13/2)*x[2] + (im*Om23/2)*x[3] - gamma31tilde*x[5]-im*Om13/2
    dy[6] = -im*conj(Om13)*x[1]-im*(conj(Om13)/2)*x[2]-(im*conj(Om23)/2)*x[4]-conj(gamma31tilde)*x[6]+im*conj(Om13)/2
    dy[7] = (im*Om23/2)*x[1]+im*Om23*x[2]+(im*Om13/2)*x[4]-gamma32tilde*x[7] - im*Om23/2
    dy[8] = (-im*conj(Om23)/2)*x[1]-im*conj(Om23)*x[2]-(im*conj(Om13)/2)*x[3]-conj(gamma32tilde)*x[8]+im*conj(Om23)/2
end

paths = [[] for i=1:N_real]
xs = [[] for i=1:N_real]
ts = [[] for i=1:N_real]
v_perps = zeros(Float32, N_real)
function prob_func(prob, i, repeat)
    # change seed of random number generation as the solver uses multithreading
    iinit, jinit, ifinal, jfinal = choose_points()
    vz = draw_vz(v)
    v_perp = sqrt(v^2 - vz^2)
    xinit = jinit*window/N_grid
    yinit = iinit*window/N_grid
    xfinal = jfinal*window/N_grid
    yfinal = ifinal*window/N_grid
    # velocity unit vector
    if v_perp != 0
        u0 = xfinal-xinit
        u1 = yfinal-yinit
        norm = hypot(u0, u1)
        u0 /= norm
        u1 /= norm
        new_tfinal = hypot((xfinal-xinit), (yfinal-yinit))/v_perp
    else
        u0 = u1 = 0
        new_tfinal = hypot((xfinal-xinit), (yfinal-yinit))/abs(vz)
    end
    new_p = ComplexF32[v_perp + 0*im, u0 + 0*im, u1 + 0*im, xinit + 0*im, yinit + 0*im,
             Gamma, Omega13, Omega23, gamma21tilde, gamma31tilde - im*k*vz,
             gamma32tilde - im*k*vz, waist, r0]
    new_tspan = (0.0f0, Float32(new_tfinal))
    tsave = collect(LinRange{Float32}(0.0f0, new_tfinal, 1000))
    remake(prob, p=new_p, tspan=new_tspan, saveat=tsave)
end
# instantiate a problem
p = ComplexF32[1.0 + 0*im, 1.0 + 0*im, 1.0 + 0*im, 1.0 + 0*im, 1.0 + 0*im,
     Gamma, Omega13,
     Omega23, gamma21tilde,
     gamma31tilde - im*k*0.0,
     gamma32tilde - im*k*0.0, waist, r0]
tspan = (0.0f0, 1.0f0)
tsave = collect(LinRange{Float32}(0.0f0, 1.0f0, 1000))
prob = ODEProblem{true}(f!, x0, tspan, p, saveat=tsave)
ensembleprob = EnsembleProblem(prob, prob_func=prob_func)
@time sol = solve(ensembleprob, BS3(), EnsembleGPUArray(), trajectories=N_real)

color vector handling and sparse representation

The color vector handling isn't working because the sparsity pattern itself isn't given, which then means that the decompression doesn't occur. That isn't "really" an issue, since in our case, because it's block diagonal, it doesn't need to decompress. However, if the values don't end up back on the blocks, then I/gamma - J doesn't actually I/gamma the right components, since the diagonal parts get the addition while the off diagonal but actually diagonal parts do not. Thus the idea of treating the system as M*N x N, i.e. just stacking the blocks with vcat has a subtle bug if you try to do that. That would be a dense representation of the matrix, but we need to be careful since the machinery of OrdinaryDiffEq isn't setup for jac_prototype being non-square with a weird interpretation.

Instead of resorting to a non-standard array type and trying to get that to GPU compile (i.e. array of static arrays, which seems to have some interesting issues), the solution would be to define Wfact directly, which we would need to double check in DiffEq that it doesn't make any square assumption, and then we hijack the interpretation with our linear solve.

This has been a much more difficult issue than I anticipated and the current projects can all define the Jacobian function directly, so for now I'm punting on this and will return after things are working. A dense square block matrix with a block factorization is not great for memory but shouldn't give runtime issues.

Support CallbackSet

Currently the callback support directly looks for ContinuousCallback and DiscreteCallback. Instead, it should always convert to CallbackSet and to the current handling recursively on the callbacks of a CallbackSet.

Register

@JuliaRegistrator register()

Slicing causes GPUifyLoops failures

using OrdinaryDiffEq, DiffEqGPU
function lorenz2(du,u,p,t)
 @inbounds begin
     a = u[1:2]
     du[1] = u[1]
 end
 nothing
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = (10.0f0,28.0f0,8/3f0)
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

@vchuravy do you know about this one?

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

EnsembleGPUArray failing when using ModelingToolkit

Hi! The same MWE from #56 is not working anymore:

using ModelingToolkit
using OrdinaryDiffEq
using DiffEqGPU
using KernelAbstractions

@parameters t σ ρ β
@variables x(t) y(t) z(t)
D = Differential(t)

eqs = [D(x) ~ σ*(y-x),
       D(y) ~ x*(ρ-z)-y,
       D(z) ~ x*y - β*z]

sys = ODESystem(eqs)

u0 = [x => 1.0f0
      y => 0.0f0
      z => 0.0f0]

p  = [σ => 10.0f0
      ρ => 28.0f0
      β => 8f0/3f0]

tspan = (0.0f0, 100.0f0)
prob = ODEProblem(sys, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = rand(Float32, 3).*prob.p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
sol = solve(monteprob, Tsit5(), EnsembleGPUArray(), trajectories = 10, saveat = 1.0f0)

ERROR: LoadError: GPU compilation of kernel gpu_gpu_kernel(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, Float32) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}}, which is not isbits:
  .f_oop is of type RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr} which is not isbits.
    .body is of type Expr which is not isbits.
      .head is of type Symbol which is not isbits.
      .args is of type Vector{Any} which is not isbits.
  .f_iip is of type RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr} which is not isbits.
    .body is of type Expr which is not isbits.
      .head is of type Symbol which is not isbits.
      .args is of type Vector{Any} which is not isbits.


Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob, entry::LLVM.Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/validation.jl:68
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:287 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:286 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/M4jkK/src/compiler/execution.jl:306
  [7] check_cache
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/cache.jl:44 [inlined]
  [8] cached_compilation
    @ ./none:0 [inlined]
  [9] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, Float32}}; name::String, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/M4jkK/src/compiler/execution.jl:294
 [10] macro expansion
    @ ~/.julia/packages/CUDA/M4jkK/src/compiler/execution.jl:102 [inlined]
 [11] (::KernelAbstractions.Kernel{CUDAKernels.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Int64, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/94MY8/src/CUDAKernels.jl:192
 [12] (::DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)})(du::CUDA.CuArray{Float32, 2}, u::CUDA.CuArray{Float32, 2}, p::CUDA.CuArray{Float32, 2}, t::Float32)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/YMmTv/src/DiffEqGPU.jl:408
 [13] ODEFunction
    @ ~/.julia/packages/SciMLBase/9EjAY/src/scimlfunctions.jl:334 [inlined]
 [14] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CUDA.CuArray{Float32, 2}, Nothing, Float32, CUDA.CuArray{Float32, 2}, Float32, Float32, Float32, Vector{CUDA.CuArray{Float32, 2}}, ODESolution{Float32, 3, Vector{CUDA.CuArray{Float32, 2}}, Nothing, Nothing, Vector{Float32}, Vector{Vector{CUDA.CuArray{Float32, 2}}}, ODEProblem{CUDA.CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CUDA.CuArray{Float32, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CUDA.CuArray{Float32, 2}}, Vector{Float32}, Vector{Vector{CUDA.CuArray{Float32, 2}}}, OrdinaryDiffEq.Tsit5Cache{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}, OrdinaryDiffEq.DEOptions{Float32, Float32, Float32, Float32, typeof(DiffEqGPU.diffeqgpunorm), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#10#16", DataStructures.BinaryMinHeap{Float32}, DataStructures.BinaryMinHeap{Float32}, Nothing, Nothing, Int64, Tuple{}, Float32, Tuple{}}, CUDA.CuArray{Float32, 2}, Float32, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2Z4fE/src/perform_step/low_order_rk_perform_step.jl:623
 [15] __init(prob::ODEProblem{CUDA.CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CUDA.CuArray{Float32, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Float32, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float32, dtmin::Nothing, dtmax::Float32, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta1::Nothing, beta2::Nothing, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#10#16", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2Z4fE/src/solve.jl:433
 [16] #__solve#403
    @ ~/.julia/packages/OrdinaryDiffEq/2Z4fE/src/solve.jl:4 [inlined]
 [17] #solve_call#56
    @ ~/.julia/packages/DiffEqBase/jhLIm/src/solve.jl:61 [inlined]
 [18] solve_up(prob::ODEProblem{CUDA.CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CUDA.CuArray{Float32, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CUDA.CuArray{Float32, 2}, p::CUDA.CuArray{Float32, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:unstable_check, :saveat, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#10#16", Float32, Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/jhLIm/src/solve.jl:82
 [19] #solve#57
    @ ~/.julia/packages/DiffEqBase/jhLIm/src/solve.jl:70 [inlined]
 [20] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, ODEFunction{true, ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Vector{Symbol}, Symbol, ModelingToolkit.var"#121#generated_observed#155"{Bool, ODESystem, Dict{Any, Any}}, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, ODEFunction{true, ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Vector{Symbol}, Symbol, ModelingToolkit.var"#121#generated_observed#155"{Bool, ODESystem, Dict{Any, Any}}, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{Float32}, p::Matrix{Float32}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/YMmTv/src/DiffEqGPU.jl:313
 [21] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, ODEFunction{true, ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Vector{Symbol}, Symbol, ModelingToolkit.var"#121#generated_observed#155"{Bool, ODESystem, Dict{Any, Any}}, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/YMmTv/src/DiffEqGPU.jl:278
 [22] macro expansion
    @ ./timing.jl:287 [inlined]
 [23] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, Vector{Float32}, ODEFunction{true, ModelingToolkit.var"#f#148"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0xe09e3e39, 0x901fd863, 0xaabe4072, 0xa349f5db, 0x1e2ac5dd)}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(Symbol("##out#353"), Symbol("##arg#351"), Symbol("##arg#352"), :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", (0x255a3166, 0x06ba9a71, 0xa2d333f0, 0xa11339db, 0xc80f8c0d)}}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Vector{Symbol}, Symbol, ModelingToolkit.var"#121#generated_observed#155"{Bool, ODESystem, Dict{Any, Any}}, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#1#2", typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Symbol, Float32, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/YMmTv/src/DiffEqGPU.jl:195
 [24] #solve#59
    @ ~/.julia/packages/DiffEqBase/jhLIm/src/solve.jl:96 [inlined]
 [25] top-level scope
    @ ~/test_GPU/test.jl:27
 [26] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [27] top-level scope
    @ REPL[2]:1
in expression starting at ~/test_GPU/test.jl:27

(test_GPU) pkg> st
      Status `~/test_GPU/Project.toml`
  [052768ef] CUDA v2.6.3
  [071ae1c0] DiffEqGPU v1.11.0
  [63c18a36] KernelAbstractions v0.6.1
  [961ee093] ModelingToolkit v5.16.0
  [1dea7af3] OrdinaryDiffEq v5.52.6


julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 20

eGPU runs slower than without only when using multiprocessing

using DiffEqGPU, CuArrays, OrdinaryDiffEq, Test
CuArrays.device!(0)

using Distributed
addprocs(2)
@everywhere using DiffEqGPU, CuArrays, OrdinaryDiffEq, Test, Random

@everywhere begin
    function lorenz_distributed(du,u,p,t)
     @inbounds begin
         du[1] = p[1]*(u[2]-u[1])
         du[2] = u[1]*(p[2]-u[3]) - u[2]
         du[3] = u[1]*u[2] - p[3]*u[3]
     end
     nothing
    end
    CuArrays.allowscalar(false)
    u0 = Float32[1.0;0.0;0.0]
    tspan = (0.0f0,100.0f0)
    p = (10.0f0,28.0f0,8/3f0)
    Random.seed!(1)
    pre_p_distributed = [rand(Float32,3) for i in 1:100_000]
    function prob_func_distributed(prob,i,repeat)
        remake(prob,p=pre_p_distributed[i].*p)
    end
end

@sync begin
    @spawnat 2 CuArrays.device!(0)
    @spawnat 3 CuArrays.device!(1)
end

CuArrays.allowscalar(false)
prob = ODEProblem(lorenz_distributed,u0,tspan,p)
monteprob = EnsembleProblem(prob, prob_func = prob_func_distributed)

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,batch_size=50_000,saveat=1.0f0)
@time solve(monteprob,Tsit5(),EnsembleThreads(), trajectories=100_000,saveat=1.0f0)

@vchuravy

Mysterious CI-only failure of reductions.jl tests

MWE:

# ode checks
using OrdinaryDiffEq, DiffEqGPU, Test

seed = 100
using Random;Random.seed!(seed)
ra = rand(100)

function f!(du,u,p,t)
     du[1] = 1.01*u[1]
end

prob = ODEProblem(f!,[0.5],(0.0,1.0))

function output_func(sol,i)
  last(sol), false
end

function prob_func(prob,i,repeat)
  remake(prob,u0=ra[i]*prob.u0)
end

function reduction(u,batch,I)
  u.+sum(batch),false
end

# no reduction
prob1 = EnsembleProblem(prob,prob_func=prob_func,output_func=output_func)
sim1 = @time solve(prob1,Tsit5(),trajectories=100,batch_size=20)

# reduction and EnsembleThreads()
prob2 = EnsembleProblem(prob,prob_func=prob_func,output_func=output_func,
  reduction=reduction,u_init=Vector{eltype(prob.u0)}([0.0])
  )
sim2 = @time solve(prob2,Tsit5(),trajectories=100,batch_size=20)


# EnsembleCPUArray() and EnsembleGPUArray()
sim3 = @time solve(prob2,Tsit5(),EnsembleCPUArray(),trajectories=100,batch_size=20)
sim4 = @time solve(prob2,Tsit5(),EnsembleGPUArray(),trajectories=100,batch_size=20)

@info  sim2[1]

@test sum(sim1.u) ≈ sim2.u
@test sim2.u ≈ sim3.u
@test sim2.u ≈ sim4.u

<html>
<body>
<!--StartFragment-->

Reduction: Error During Test at /root/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:25
--
  | Got exception outside of a @test
  | LoadError: KernelException: exception thrown during kernel execution on device NVIDIA A100-PCIE-40GB MIG 1g.5gb
  | Stacktrace:
  | [1] check_exceptions()
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/src/compiler/exceptions.jl:34
  | [2] nonblocking_synchronize
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/lib/cudadrv/context.jl:331 [inlined]
  | [3] device_synchronize()
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/lib/cudadrv/context.jl:319
  | [4] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/lib/cudadrv/module.jl:41
  | [5] CuModule
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/lib/cudadrv/module.jl:23 [inlined]
  | [6] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:236 [inlined]
  | [7] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/src/compiler/execution.jl:479 [inlined]
  | [8] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:236
  | [9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
  | @ GPUCompiler ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUCompiler/XyxTy/src/cache.jl:95
  | [10] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/src/compiler/execution.jl:299 [inlined]
  | [11] cufunction(f::typeof(CUDA.partial_mapreduce_grid), tt::Type{Tuple{typeof(identity), typeof(Base.add_sum), Float64, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, CartesianIndices{2, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}}, Val{true}, CUDA.CuDeviceArray{Float64, 3, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(abs2), Tuple{CUDA.CuDeviceMatrix{Float64, 1}}}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:236
  | [12] cufunction
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/TimerOutputs/jgSVI/src/TimerOutput.jl:229 [inlined]
  | [13] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
  | [14] mapreducedim!(f::typeof(identity), op::typeof(Base.add_sum), R::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, A::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(abs2), Tuple{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}}; init::Float64)
  | @ CUDA ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/CUDA/GGwVa/src/mapreduce.jl:234
  | [15] _mapreduce(f::typeof(abs2), op::typeof(Base.add_sum), As::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}; dims::Colon, init::Nothing)
  | @ GPUArrays ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUArrays/Zecv7/src/host/mapreduce.jl:69
  | [16] #mapreduce#20
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUArrays/Zecv7/src/host/mapreduce.jl:31 [inlined]
  | [17] mapreduce
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/GPUArrays/Zecv7/src/host/mapreduce.jl:31 [inlined]
  | [18] #_sum#741
  | @ ./reducedim.jl:894 [inlined]
  | [19] _sum
  | @ ./reducedim.jl:894 [inlined]
  | [20] #sum#739
  | @ ./reducedim.jl:890 [inlined]
  | [21] sum
  | @ ./reducedim.jl:890 [inlined]
  | [22] diffeqgpunorm
  | @ /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:289 [inlined]
  | [23] __init(prob::SciMLBase.ODEProblem{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, CUDA.CuArray{SciMLBase.NullParameters, 2, CUDA.Mem.DeviceBuffer}, SciMLBase.ODEFunction{true, DiffEqGPU.var"#59#63"{typeof(Main.##4656.f!), typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
  | @ OrdinaryDiffEq ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/OrdinaryDiffEq/irVAX/src/solve.jl:274
  | [24] #__solve#502
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/OrdinaryDiffEq/irVAX/src/solve.jl:4 [inlined]
  | [25] #solve_call#28
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/DiffEqBase/kFf4V/src/solve.jl:388 [inlined]
  | [26] solve_up(prob::SciMLBase.ODEProblem{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Tuple{Float64, Float64}, true, CUDA.CuArray{SciMLBase.NullParameters, 2, CUDA.Mem.DeviceBuffer}, SciMLBase.ODEFunction{true, DiffEqGPU.var"#59#63"{typeof(Main.##4656.f!), typeof(DiffEqGPU.gpu_kernel)}, LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, p::CUDA.CuArray{SciMLBase.NullParameters, 2, CUDA.Mem.DeviceBuffer}, args::OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
  | @ DiffEqBase ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/DiffEqBase/kFf4V/src/solve.jl:686
  | [27] #solve#29
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/DiffEqBase/kFf4V/src/solve.jl:670 [inlined]
  | [28] batch_solve_up(ensembleprob::SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, probs::Vector{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, ensemblealg::DiffEqGPU.EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{Float64}, p::Matrix{SciMLBase.NullParameters}; kwargs::Base.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
  | @ DiffEqGPU /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:362
  | [29] batch_solve(ensembleprob::SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, alg::OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, ensemblealg::DiffEqGPU.EnsembleGPUArray, I::UnitRange{Int64}, adaptive::Bool; kwargs::Base.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
  | @ DiffEqGPU /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:326
  | [30] (::DiffEqGPU.var"#9#15"{Int64, DiffEqGPU.var"#12#18", Bool, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, DiffEqGPU.EnsembleGPUArray, Int64})(i::Int64)
  | @ DiffEqGPU /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:247
  | [31] iterate
  | @ ./generator.jl:47 [inlined]
  | [32] _collect(c::UnitRange{Int64}, itr::Base.Generator{UnitRange{Int64}, DiffEqGPU.var"#9#15"{Int64, DiffEqGPU.var"#12#18", Bool, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, DiffEqGPU.EnsembleGPUArray, Int64}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
  | @ Base ./array.jl:744
  | [33] collect_similar(cont::UnitRange{Int64}, itr::Base.Generator{UnitRange{Int64}, DiffEqGPU.var"#9#15"{Int64, DiffEqGPU.var"#12#18", Bool, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, DiffEqGPU.EnsembleGPUArray, Int64}})
  | @ Base ./array.jl:653
  | [34] map(f::Function, A::UnitRange{Int64})
  | @ Base ./abstractarray.jl:2867
  | [35] macro expansion
  | @ /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:241 [inlined]
  | [36] macro expansion
  | @ ./timing.jl:299 [inlined]
  | [37] __solve(ensembleprob::SciMLBase.EnsembleProblem{SciMLBase.ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, SciMLBase.NullParameters, SciMLBase.ODEFunction{true, typeof(Main.##4656.f!), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(Main.##4656.prob_func), typeof(Main.##4656.output_func), typeof(Main.##4656.reduction), Vector{Float64}}, alg::OrdinaryDiffEq.Tsit5{typeof(OrdinaryDiffEq.trivial_limiter!), typeof(OrdinaryDiffEq.trivial_limiter!), Static.False}, ensemblealg::DiffEqGPU.EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, adaptive::Bool, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
  | @ DiffEqGPU /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/src/DiffEqGPU.jl:240
  | [38] #solve#31
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/DiffEqBase/kFf4V/src/solve.jl:700 [inlined]
  | [39] top-level scope
  | @ ./timing.jl:220
  | [40] include(mod::Module, _path::String)
  | @ Base ./Base.jl:418
  | [41] include(x::String)
  | @ Main.##4656 ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:23
  | [42] macro expansion
  | @ /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/test/runtests.jl:22 [inlined]
  | [43] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.7/julia-1.7-latest-linux-x86_64/share/julia/stdlib/v1.7/Test/src/Test.jl:1283 [inlined]
  | [44] top-level scope
  | @ /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/test/runtests.jl:22
  | [45] eval(m::Module, e::Any)
  | @ Core ./boot.jl:373
  | [46] macro expansion
  | @ ~/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b/packages/SafeTestsets/A83XK/src/SafeTestsets.jl:23 [inlined]
  | [47] top-level scope
  | @ timing.jl:220
  | [48] include(fname::String)
  | @ Base.MainInclude ./client.jl:451
  | [49] top-level scope
  | @ none:5
  | [50] eval
  | @ ./boot.jl:373 [inlined]
  | [51] exec_options(opts::Base.JLOptions)
  | @ Base ./client.jl:268
  | in expression starting at /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/test/reduction.jl:39
  | Test Summary: \| Error  Total
  | Reduction     \|     1      1
  | ERROR: LoadError: Some tests did not pass: 0 passed, 0 failed, 1 errored, 0 broken.
  | in expression starting at /var/lib/buildkite-agent/builds/gpuci-13/julialang/diffeqgpu-dot-jl/test/runtests.jl:22
  | ERROR: LoadError: failed process: Process(`/root/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.7/julia-1.7-latest-linux-x86_64/bin/julia -Cnative -J/root/.cache/julia-buildkite-plugin/julia_installs/bin/linux/x64/1.7/julia-1.7-latest-linux-x86_64/lib/julia/sys.so --depwarn=yes -g1 --color=yes --startup-file=no --eval 'append!(empty!(Base.DEPOT_PATH), ["/root/.cache/julia-buildkite-plugin/depots/26e4f8df-bbdd-40a2-82e4-24a159795e4b"])
  | append!(empty!(Base.DL_LOAD_PATH), String[])

<!--EndFragment-->
</body>
</html>

@vchuravy @maleadt @utkarsh530 are you able to reproduce this locally? I can't for some reason.

Possible Memory Error?

using DiffEqGPU, CuArrays, OrdinaryDiffEq, Test, StaticArrays

function f(du,u,p,t)
  @inbounds begin
  x = SVector{7}(u[1],u[2],u[3],u[4],u[5],u[6],u[7])
  y = SVector{7}(u[8],u[9],u[10],u[11],u[12],u[13],u[14])
  du[1] = u[15]
  du[2] = u[16]
  du[3] = u[17]
  du[4] = u[18]
  du[5] = u[19]
  du[6] = u[20]
  du[7] = u[21]
  du[8] = u[22]
  du[9] = u[23]
  du[10] = u[24]
  du[11] = u[25]
  du[12] = u[26]
  du[13] = u[27]
  du[14] = u[28]
  for i in 14:28
    du[i] = 0.0f0
  end
  for i=1:7,j=1:7
    if i != j
      r = ((x[i]-x[j])*(x[i]-x[j]) + (y[i] - y[j])*(y[i] - y[j]))^(1.5f0)
      du[14+i] += j*(x[j] - x[i])/r
      du[21+i] += j*(y[j] - y[i])/r
    end
  end
  end
  nothing
end

prob = ODEProblem(f,Float32[3.0,3.0,-1.0,-3.0,2.0,-2.0,2.0,3.0,-3.0,2.0,0,0,-4.0,4.0,0,0,0,0,0,1.75,-1.5,0,0,0,-1.25,1,0,0],(0.0f0,3.0f0))
@time sol = solve(prob,Tsit5())
monteprob = EnsembleProblem(prob)
println("GPU")
@time solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_0,saveat=0.5f0)

works while

@time solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=0.5f0)

throws

InvalidIRError: compiling reduce_kernel(CuArrays.CuKernelState, typeof(==), typeof(&), Bool, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, Val{256}, CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to GPUArrays.simple_broadcast_index)
Stacktrace:
 [1] reduce_kernel at C:\Users\accou\.julia\packages\GPUArrays\fAX0Q\src\mapreduce.jl:141
check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at validation.jl:114
macro expansion at TimerOutput.jl:216 [inlined]
#codegen#121(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at driver.jl:186
#codegen at none:0 [inlined]
#compile#120(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Function, ::Symbol, ::CUDAnative.CompilerJob) at driver.jl:47
#compile at none:0 [inlined]
#compile#119 at driver.jl:28 [inlined]
#compile at none:0 [inlined]
#compile at none:0 [inlined]
macro expansion at execution.jl:388 [inlined]
#cufunction#161(::Nothing, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::typeof(GPUArrays.reduce_kernel), ::Type{Tuple{CuArrays.CuKernelState,typeof(==),typeof(&),Bool,CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},Val{256},CUDAnative.CuDeviceArray{Bool,1,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}}}) at execution.jl:356
cufunction(::Function, ::Type) at execution.jl:356
macro expansion at execution.jl:174 [inlined]
macro expansion at gcutils.jl:87 [inlined]
macro expansion at execution.jl:171 [inlined]
_gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Bool,1}, ::Tuple{typeof(==),typeof(&),Bool,CuArray{Float32,2},Val{256},CuArray{Bool,1},CuArray{Float32,2}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at gpuarray_interface.jl:59
gpu_call(::Function, ::CuArray{Bool,1}, ::Tuple{typeof(==),typeof(&),Bool,CuArray{Float32,2},Val{256},CuArray{Bool,1},CuArray{Float32,2}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at abstract_gpu_interface.jl:151
acc_mapreduce(::Function, ::Function, ::Bool, ::CuArray{Float32,2}, ::Tuple{CuArray{Float32,2}}) at mapreduce.jl:186
#mapreduce#50 at mapreduce.jl:87 [inlined]
#mapreduce at none:0 [inlined]
== at mapreduce.jl:13 [inlined]
ode_determine_initdt(::CuArray{Float32,2}, ::Float32, ::Float32, ::Float32, ::Float32, ::Float32, ::typeof(DiffEqBase.ODE_DEFAULT_NORM), ::ODEProblem{CuArray{Float32,2},Tuple{Float32,Float32},true,CuArray{Nothing,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(f),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Nothing,DiffEqBase.StandardODEProblem}, ::OrdinaryDiffEq.ODEIntegrator{Tsit5,true,CuArray{Float32,2},Float32,CuArray{Nothing,2},Float32,Float32,Float32,Array{CuArray{Float32,2},1},ODESolution{Float32,3,Array{CuArray{Float32,2},1},Nothing,Nothing,Array{Float32,1},Array{Array{CuArray{Float32,2},1},1},ODEProblem{CuArray{Float32,2},Tuple{Float32,Float32},true,CuArray{Nothing,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(f),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Nothing,DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(f),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{CuArray{Float32,2},1},Array{Float32,1},Array{Array{CuArray{Float32,2},1},1},OrdinaryDiffEq.Tsit5Cache{CuArray{Float32,2},CuArray{Float32,2},CuArray{Float32,2},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}}},DiffEqBase.DEStats},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(f),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5Cache{CuArray{Float32,2},CuArray{Float32,2},CuArray{Float32,2},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(LinearAlgebra.opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Ar...

Splatting causes GPUifyLoops compilation failure

using OrdinaryDiffEq, DiffEqGPU
function lorenz(du,u,p,t)
 @inbounds begin
     a,b,c = p
     du[1] = a*(u[2]-u[1])
     du[2] = u[1]*(b-u[3]) - u[2]
     du[3] = u[1]*u[2] - c*u[3]
 end
 nothing
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = (10.0f0,28.0f0,8/3f0)
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

@vchuravy do you know about this one?

Add AMDGPU support

With ROCKernels being merged into KernelAbstractions, we should soon be able to integrate AMDGPU support into this package!

@ChrisRackauckas if this is desired, how would you like me to do this? Currently DiffEqGPU depends on CUDA; I could also have it depend on AMDGPU.jl as well, and ensure I keep dependencies up-to-date so we don't run into version conflicts and unnecessary downgrades. Alternatively, I could make AMDGPU (and optionally CUDA) gated behind Requires, although it could possibly impact compile times post-precompile.

Support for delay and/or stochastic delay problems

It would be helpful to apply GPUs for ensemble studies with the delay and stochastic delay problems.

matrix differential equation support

The documentation says CuArray u0 are supported.
But only one dimensional arrays are supported because of lines like:

DiffEqGPU.jl/src/DiffEqGPU.jl

Line 345 in 8489a3e

u0 = reduce(hcat, Array(probs[i].u0) for i in 1:length(I))

To generalize this to N dimensions the u0 should concatenate across an additional dimension.
And the kernels should then index over this last dimension, instead of assuming there are only 2:

DiffEqGPU.jl/src/DiffEqGPU.jl

Line 24 in 8489a3e

@views @inbounds f(du[:, i], u[:, i], p[:, i], t)

LoadError: KernelException on README example

Hi, if I run the following:

using OrdinaryDiffEq, CUDA, LinearAlgebra
using DiffEqGPU
function lorenz(du, u, p, t)
    du[1] = p[1] * (u[2] - u[1])
    du[2] = u[1] * (p[2] - u[3]) - u[2]
    du[3] = u[1] * u[2] - p[3] * u[3]
end

u0 = Float32[1.0; 0.0; 0.0]
tspan = (0.0f0, 100.0f0)
p = [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = rand(Float32, 3) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
sol = solve(monteprob, Tsit5(), EnsembleGPUArray(), trajectories = 10, saveat = 1.0f0)

I get the following error....

ERROR: a exception was thrown during kernel execution.
Run Julia on debug level 2 for device stack traces.
ERROR: LoadError: KernelException: exception thrown during kernel execution on device Tesla V100-DGXS-16GB
Stacktrace:

this is my env: (DiffEqGPU and DiffEqGPU#master show the same error).

~/JuliaConCUDA/Project.toml`
  [621f4979] AbstractFFTs v1.1.0
  [6e4b80f9] BenchmarkTools v1.2.2
  [052768ef] CUDA v3.8.0
  [72cfdca4] CUDAKernels v0.2.1
  [3da002f7] ColorTypes v0.11.0
  [5ae59095] Colors v0.12.8
  [071ae1c0] DiffEqGPU v1.15.0 `https://github.com/SciML/DiffEqGPU.jl.git#master`
  [5789e2e9] FileIO v1.13.0
  [53c48c17] FixedPointNumbers v0.8.4
  [f332f351] ImageContrastAdjustment v0.3.10
  [a09fc81d] ImageCore v0.9.3
  [6a3955dd] ImageFiltering v0.7.1
  [6218d12a] ImageMagick v1.2.2
  [4e3cecfd] ImageShow v0.3.3
  [63c18a36] KernelAbstractions v0.6.3
  [1dea7af3] OrdinaryDiffEq v6.6.6
  [62fd8b95] TensorCore v0.1.1
  [5e47fb64] TestImages v1.6.2
  [bc48ee85] Tullio v0.3.3
  [37e2e46d] LinearAlgebra

Broadcasting in Kernels

using DiffEqGPU, OrdinaryDiffEq
pa = [1.0]
u0 = [3.0]

function f(du,u,p,t)
    du[1] = 1.01 * u[1] * p[1]
end

function f2(du,u,p,t)
    du .= 1.01 .* u .* p
end

prob = ODEProblem(f2, u0, (0.0, 1.0), pa)

function prob_func(prob, i, repeat)
  remake(prob, u0 = 0.5 .+ i/100 .* prob.u0)
end

ensemble_prob = EnsembleProblem(prob, prob_func = prob_func)
sim = solve(ensemble_prob, Tsit5(), EnsembleGPUArray(), saveat = 0.1, trajectories = 100)

CI Test Failures while locally passing

https://gitlab.com/juliadiffeq/DiffEqGPU-jl/-/jobs/302470230

ERROR: LoadError: InvalidIRError: compiling gpu_kernel(Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#363")),Nothing,Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_kernel), ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)

@vchuravy this is even after the GPUArrays.jl update I think.

Scope of improvements in EnsembleGPUKernel

#170
The latest profile, while solving from EnsembleGPUKernel, raises some questions:

Some overheads are discussed here for potential improvements EnsembleGPUKernel for Tsit5.

Converting the solution back to CuArrays.
The reason for this overhead (converting to CPU Arrays) is to provide users access to something like sol[i].u[j] where i,j are some indexes. It would cause scalar indexing on ts,us, which are CuArrays.

Possible workaround: Leave it to the user to convert to CPU Arrays if it needs to index the solution.

Ensemble problem creation for parameter parallelism
The probs creation within the DiffEqGPU seems to be necessary, but maybe it could be pulled out of DiffEqGPU? Currently, it was done to adhere to the DiffEqGPU way of handling it. This was not coming in the previous benchmarks because ps was being built separately and passed to the vectorized_solve.

Possible workaround: create ps or u0s and pass them into DiffEqGPU instead of only specifying the trajectories, and the library handles the rest.

If we don’t convert to CPU Arrays, we’ll get good performance (~2x faster) plus if we let user build ps (instead of asking the trajectories and building ourselves), we’ll probably reach the desired benchmark.

Skip parameters that lead to errors

Hello i am trying to solve an ODE system for different parameter vectors. However, some of the parameters lead to imaginary or unstable solutions. Is it possible to just skip them? Right now i get:

ERROR: LoadError: DomainError with -2.8762994816842586:
Exponentiation yielding a complex result requires a complex argument.
Replace x^y with (x+0im)^y, Complex(x)^y, or similar.
Stacktrace:
[1] throw_exp_domainerror(x::Float64)
@ Base.Math ./math.jl:37
[2] ^
@ ./math.jl:909 [inlined]
[3] Glucose_Yeast2(du::Vector{Float64}, u::Vector{Float64}, p::Vector{Float64}, t::Float64)
@ Main ~/Documents/julia_GPU_solver/Plot_GY_Test.jl:10
[4] ODEFunction
@ ~/.julia/packages/SciMLBase/UEAKN/src/scimlfunctions.jl:1613 [inlined]
[5] perform_step!(integrator::OrdinaryDiffEq.ODEIntegrator{Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, true, Vector{Float64}, Nothing, Float64, Vector{Float64}, Float64, Float64, Float64, Float64, Vector{Vector{Float64}}, ODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{Vector{Float64}}}, ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, OrdinaryDiffEq.InterpolationData{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Vector{Float64}}, Vector{Float64}, Vector{Vector{Vector{Float64}}}, OrdinaryDiffEq.Rosenbrock5Cache{Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Matrix{Float64}, OrdinaryDiffEq.Rodas5Tableau{Float64, Float64}, SciMLBase.TimeGradientWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Float64}, Vector{Float64}}, SciMLBase.UJacobianWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Float64, Vector{Float64}}, LinearSolve.LinearCache{Matrix{Float64}, Vector{Float64}, Vector{Float64}, SciMLBase.NullParameters, GenericLUFactorization{LinearAlgebra.RowMaximum}, LinearAlgebra.LU{Float64, Matrix{Float64}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float64, Vector{Float64}}}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Float64}, SparseDiffTools.ForwardColorJacCache{Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{Float64}, Vector{Vector{NTuple{7, Float64}}}, UnitRange{Int64}, Nothing}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 1}}, Float64, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}}, DiffEqBase.DEStats}, ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Rosenbrock5Cache{Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Matrix{Float64}, OrdinaryDiffEq.Rodas5Tableau{Float64, Float64}, SciMLBase.TimeGradientWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Float64}, Vector{Float64}}, SciMLBase.UJacobianWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Float64, Vector{Float64}}, LinearSolve.LinearCache{Matrix{Float64}, Vector{Float64}, Vector{Float64}, SciMLBase.NullParameters, GenericLUFactorization{LinearAlgebra.RowMaximum}, LinearAlgebra.LU{Float64, Matrix{Float64}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float64, Vector{Float64}}}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Float64}, SparseDiffTools.ForwardColorJacCache{Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{Float64}, Vector{Vector{NTuple{7, Float64}}}, UnitRange{Int64}, Nothing}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 1}}, Float64, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqBase.ODE_DEFAULT_NORM), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, var"#3#5", typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, Nothing, Nothing, Int64, Tuple{}, Vector{Any}, Tuple{}}, Vector{Float64}, Float64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Rosenbrock5Cache{Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Matrix{Float64}, OrdinaryDiffEq.Rodas5Tableau{Float64, Float64}, SciMLBase.TimeGradientWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Float64}, Vector{Float64}}, SciMLBase.UJacobianWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Float64, Vector{Float64}}, LinearSolve.LinearCache{Matrix{Float64}, Vector{Float64}, Vector{Float64}, SciMLBase.NullParameters, GenericLUFactorization{LinearAlgebra.RowMaximum}, LinearAlgebra.LU{Float64, Matrix{Float64}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float64, Vector{Float64}}}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Float64}, SparseDiffTools.ForwardColorJacCache{Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{Float64}, Vector{Vector{NTuple{7, Float64}}}, UnitRange{Int64}, Nothing}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 1}}, Float64, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, repeat_step::Bool)
@ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/ZgJ9s/src/perform_step/rosenbrock_perform_step.jl:1954
[6] perform_step!
@ ~/.julia/packages/OrdinaryDiffEq/ZgJ9s/src/perform_step/rosenbrock_perform_step.jl:1711 [inlined]
[7] solve!(integrator::OrdinaryDiffEq.ODEIntegrator{Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, true, Vector{Float64}, Nothing, Float64, Vector{Float64}, Float64, Float64, Float64, Float64, Vector{Vector{Float64}}, ODESolution{Float64, 2, Vector{Vector{Float64}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{Vector{Float64}}}, ODEProblem{Vector{Float64}, Tuple{Float64, Float64}, true, Vector{Float64}, ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}, OrdinaryDiffEq.InterpolationData{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Vector{Float64}}, Vector{Float64}, Vector{Vector{Vector{Float64}}}, OrdinaryDiffEq.Rosenbrock5Cache{Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Matrix{Float64}, OrdinaryDiffEq.Rodas5Tableau{Float64, Float64}, SciMLBase.TimeGradientWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Float64}, Vector{Float64}}, SciMLBase.UJacobianWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Float64, Vector{Float64}}, LinearSolve.LinearCache{Matrix{Float64}, Vector{Float64}, Vector{Float64}, SciMLBase.NullParameters, GenericLUFactorization{LinearAlgebra.RowMaximum}, LinearAlgebra.LU{Float64, Matrix{Float64}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float64, Vector{Float64}}}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Float64}, SparseDiffTools.ForwardColorJacCache{Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{Float64}, Vector{Vector{NTuple{7, Float64}}}, UnitRange{Int64}, Nothing}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 1}}, Float64, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}}, DiffEqBase.DEStats}, ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Rosenbrock5Cache{Vector{Float64}, Vector{Float64}, Vector{Float64}, Matrix{Float64}, Matrix{Float64}, OrdinaryDiffEq.Rodas5Tableau{Float64, Float64}, SciMLBase.TimeGradientWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Float64}, Vector{Float64}}, SciMLBase.UJacobianWrapper{ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Float64, Vector{Float64}}, LinearSolve.LinearCache{Matrix{Float64}, Vector{Float64}, Vector{Float64}, SciMLBase.NullParameters, GenericLUFactorization{LinearAlgebra.RowMaximum}, LinearAlgebra.LU{Float64, Matrix{Float64}}, LinearSolve.InvPreconditioner{LinearAlgebra.Diagonal{Float64, Vector{Float64}}}, LinearAlgebra.Diagonal{Float64, Vector{Float64}}, Float64}, SparseDiffTools.ForwardColorJacCache{Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 7}}, Vector{Float64}, Vector{Vector{NTuple{7, Float64}}}, UnitRange{Int64}, Nothing}, Vector{ForwardDiff.Dual{ForwardDiff.Tag{OrdinaryDiffEq.OrdinaryDiffEqTag, Float64}, Float64, 1}}, Float64, Rodas5{7, true, GenericLUFactorization{LinearAlgebra.RowMaximum}, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqBase.ODE_DEFAULT_NORM), typeof(LinearAlgebra.opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, var"#3#5", typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, DataStructures.BinaryHeap{Float64, DataStructures.FasterForward}, Nothing, Nothing, Int64, Tuple{}, Vector{Any}, Tuple{}}, Vector{Float64}, Float64, Nothing, OrdinaryDiffEq.DefaultInit})
@ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/ZgJ9s/src/solve.jl:477
[8] #__solve#502
@ ~/.julia/packages/OrdinaryDiffEq/ZgJ9s/src/solve.jl:5 [inlined]
[9] #solve_call#28
@ ~/.julia/packages/DiffEqBase/hZncn/src/solve.jl:428 [inlined]
[10] solve_up(prob::ODEProblem{Vector{Float64}, Tuple{Int64, Int64}, true, Vector{Float64}, ODEFunction{true, typeof(Glucose_Yeast2), LinearAlgebra.UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::Vector{Float64}, p::Vector{Float64}, args::Rodas5{0, true, Nothing, typeof(OrdinaryDiffEq.DEFAULT_PRECS), Val{:forward}, true, nothing}; kwargs::Base.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:saveat, :isoutofdomain), Tuple{Vector{Any}, var"#3#5"}}})
@ DiffEqBase ~/.julia/packages/DiffEqBase/hZncn/src/solve.jl:726
[11] #solve#29
@ ~/.julia/packages/DiffEqBase/hZncn/src/solve.jl:710 [inlined]
[12] top-level scope
@ ~/Documents/julia_GPU_solver/Plot_GY_Test.jl:74
in expression starting at /Users/malmansto/Documents/julia_GPU_solver/Plot_GY_Test.jl:74

EnsembleGPUArray performance vs EnsembleSerial

Hi! Using the Lorenz example in the README, EnsembleGPUArray seems to be running quite a bit slower than all other methods, including EnsembleSerial. On my machine I get:

using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)

@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=10_000,saveat=1.0f0)
# 8.197300 seconds (21.42 M allocations: 1.551 GiB, 5.59% gc time)

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)
# 45.863792 seconds (118.46 M allocations: 7.534 GiB, 4.07% gc time, 8.85% compilation time)

Currently on DiffEqGPU v.1.16.0 and OrdinaryDiffEq v6.6.6. GPU is an NVIDIA Quadro T2000, CUDA version 11.6.

Direct Compilation methods

Currently being prototyped in gpu_ode.jl with SimpleDiffEq.jl. Most of the issues that need to be worked out are in GPUifyLoops.jl

Support DynamicalODEFunction

See here. It is trying to get a field f from a DynamicalODEFunction (at least in my case) and this does not exist.

Looks like this particular object has f1 and f2 but not f, but I didn't dig any further than this.

Stack trace below.

julia> @btime solve(𝔢, EnsembleGPUArray(), trajectories=10^4);
ERROR: type DynamicalODEFunction has no field f
Stacktrace:
  [1] getproperty(x::Function, f::Symbol)
    @ Base ./Base.jl:33
  [2] generate_problem(prob::ODEProblem{ArrayPartition{Float32, Tuple{SVector{3, Float32}, SVector{3
, Float32}}}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, DynamicalODEFunction{false,
ODEFunction{false, DiffEqPhysics.var"#9#17"{var"#43#44", SVector{3, Float32}}, UniformScaling{Bool},
 Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing,
Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, ODEFunction{false, DiffEqPhysics.var"#11#19"{
var"#43#44", SVector{3, Float32}}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing
, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED),
 Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Noth
ing, Nothing, Nothing, Nothing, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple
{(), Tuple{}}}, SciMLBase.StandardODEProblem}, u0::CUDA.CuArray{Float32, 2}, p::CUDA.CuArray{SciMLBa
se.NullParameters, 2}, jac_prototype::Nothing, colorvec::Nothing)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:410
  [3] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{ArrayPartition{Float32, Tuple{SVector{
3, Float32}, SVector{3, Float32}}}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, Dynami
calODEFunction{false, ODEFunction{false, DiffEqPhysics.var"#9#17"{var"#43#44", SVector{3, Float32}},
 UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothi
ng, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, ODEFunction{false, Diff
EqPhysics.var"#11#19"{var"#43#44", SVector{3, Float32}}, UniformScaling{Bool}, Nothing, Nothing, Not
hing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLB
ase.DEFAULT_OBSERVED), Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing,
Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, Base.Iterators.Pairs{Union{}, Union{
}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#7#9", var"#8#10", typeof(S
ciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{ArrayPartition{Float32, Tuple{SVecto
r{3, Float32}, SVector{3, Float32}}}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, Dyna
micalODEFunction{false, ODEFunction{false, DiffEqPhysics.var"#9#17"{var"#43#44", SVector{3, Float32}
}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Not
hing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, ODEFunction{false, Di
ffEqPhysics.var"#11#19"{var"#43#44", SVector{3, Float32}}, UniformScaling{Bool}, Nothing, Nothing, N
othing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciM
LBase.DEFAULT_OBSERVED), Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing
, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, Base.Iterators.Pairs{Union{}, Unio
n{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Nothing, ensemblealg::E
nsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{Float32}, p::Matrix{SciMLBase.NullParameters}; kwar
gs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,)
, Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:311
  [4] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{ArrayPartition{Float32, Tuple{SVector{3,
Float32}, SVector{3, Float32}}}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, Dynamical
ODEFunction{false, ODEFunction{false, DiffEqPhysics.var"#9#17"{var"#43#44", SVector{3, Float32}}, Un
iformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing,
 Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, ODEFunction{false, DiffEqP
hysics.var"#11#19"{var"#43#44", SVector{3, Float32}}, UniformScaling{Bool}, Nothing, Nothing, Nothin
g, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase
.DEFAULT_OBSERVED), Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Not
hing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, Base.Iterators.Pairs{Union{}, Union{},
Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#7#9", var"#8#10", typeof(SciM
LBase.DEFAULT_REDUCTION), Nothing}, alg::Nothing, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}
; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_c
heck,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
  [5] macro expansion
    @ ./timing.jl:287 [inlined]
  [6] __solve(ensembleprob::EnsembleProblem{ODEProblem{ArrayPartition{Float32, Tuple{SVector{3, Floa
t32}, SVector{3, Float32}}}, Tuple{Float32, Float32}, false, SciMLBase.NullParameters, DynamicalODEF
unction{false, ODEFunction{false, DiffEqPhysics.var"#9#17"{var"#43#44", SVector{3, Float32}}, Unifor
mScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Not
hing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, ODEFunction{false, DiffEqPhysi
cs.var"#11#19"{var"#43#44", SVector{3, Float32}}, UniformScaling{Bool}, Nothing, Nothing, Nothing, N
othing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEF
AULT_OBSERVED), Nothing}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing
, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tupl
e{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, var"#7#9", var"#8#10", typeof(SciMLBas
e.DEFAULT_REDUCTION), Nothing}, alg::Nothing, ensemblealg::EnsembleGPUArray; trajectories::Int64, ba
tch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, N
amedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
  [7] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:94 [inlined]
  [8] var"##core#311"()
    @ Main ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:479
  [9] var"##sample#312"(__params::BenchmarkTools.Parameters)
    @ Main ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:485
 [10] _run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; verbose::Bool, pad::String, kw
args::Base.Iterators.Pairs{Symbol, Integer, NTuple{4, Symbol}, NamedTuple{(:samples, :evals, :gctria
l, :gcsample), Tuple{Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:98
 [11] #invokelatest#2
    @ ./essentials.jl:710 [inlined]
 [12] #run_result#45
    @ ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:33 [inlined]
 [13] run(b::BenchmarkTools.Benchmark, p::BenchmarkTools.Parameters; progressid::Nothing, nleaves::F
loat64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol, Integer, NTuple{5, Symbol}, NamedTuple{
(:verbose, :samples, :evals, :gctrial, :gcsample), Tuple{Bool, Int64, Int64, Bool, Bool}}})
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:116
 [14] #warmup#54
    @ ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:168 [inlined]
 [15] warmup(item::BenchmarkTools.Benchmark)
    @ BenchmarkTools ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:168
 [16] top-level scope
    @ ~/.julia/packages/BenchmarkTools/Rqvks/src/execution.jl:565

Default internalnorm

Right now the default internalnorm just does the Harier norm on the whole. However it might be better to re-weigh it by applying the Hairer norm along the individual problem slices, and then using the minimum of them. This would then make sure each trajectory is appropriately error handled instead handling the error in the average trajectory.

Constant static matrices cause compilation issues

this example should not allocate:

using LinearAlgebra, DifferentialEquations, DiffEqGPU
A  = Float32[1. 0  0 -5; 4 -2  4 -3; -4  0  0  1; 5 -2  2  3]
u0 = Array{Float32}(rand(4,2))
tspan = (0.0f0,1.0f0)
function fip(du,u,p,t) 
   mul!(du, A, u)
end
probip = ODEProblem(fip,u0,tspan)
function prob_func(prob, i, repeat)
    prob
end
ensprobip=EnsembleProblem(probip, prob_func=prob_func)
trajs = 123
simgpu = solve(ensprobip, Tsit5(), EnsembleGPUArray(), trajectories=trajs)

i ran it on a linux machine with CUDA 9.0 and several Tesla K80 GPUs with compute capability 3.7.

with EnsembleCPUArray it works fine, but it fails to compile on GPU with the stack trace:

InvalidIRError: compiling gpu_kernel(Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#371")),Nothing,Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_kernel), ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{DiffEqBase.NullParameters,2,CUDAnative.AS.Global}, Float32) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] fip at In[1]:10
 [2] ODEFunction at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/diffeqfunction.jl:230
 [3] gpu_kernel at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:6
 [4] overdub at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/Cassette/YCOeN/src/overdub.jl:0

Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/validation.jl:114
 [2] macro expansion at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/driver.jl:188 [inlined]
 [3] macro expansion at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/TimerOutputs/7zSea/src/TimerOutput.jl:216 [inlined]
 [4] #codegen#130(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.codegen), ::Symbol, ::CUDAnative.CompilerJob) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/driver.jl:186
 [5] #codegen at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/driver.jl:0 [inlined]
 [6] #compile#129(::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::typeof(CUDAnative.compile), ::Symbol, ::CUDAnative.CompilerJob) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/driver.jl:47
 [7] #compile at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/common.jl:0 [inlined]
 [8] #compile#128 at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/compiler/driver.jl:28 [inlined]
 [9] #compile at ./none:0 [inlined] (repeats 2 times)
 [10] macro expansion at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/execution.jl:389 [inlined]
 [11] #cufunction#170(::String, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(CUDAnative.cufunction), ::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(Ctx),Nothing,Nothing,getfield(GPUifyLoops, Symbol("##PassType#371")),Nothing,Cassette.DisableHooks},typeof(DiffEqGPU.gpu_kernel),ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{DiffEqBase.NullParameters,2,CUDAnative.AS.Global},Float32}}) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/CUDAnative/UWBIY/src/execution.jl:357
 [12] (::getfield(CUDAnative, Symbol("#kw##cufunction")))(::NamedTuple{(:name,),Tuple{String}}, ::typeof(CUDAnative.cufunction), ::Function, ::Type) at ./none:0
 [13] #launch#50(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(GPUifyLoops.launch), ::GPUifyLoops.CUDA, ::typeof(DiffEqGPU.gpu_kernel), ::Function, ::Vararg{Any,N} where N) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/GPUifyLoops/mjszO/src/GPUifyLoops.jl:125
 [14] launch at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/GPUifyLoops/mjszO/src/GPUifyLoops.jl:119 [inlined]
 [15] macro expansion at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/GPUifyLoops/mjszO/src/GPUifyLoops.jl:54 [inlined]
 [16] #12 at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:61 [inlined]
 [17] ODEFunction at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/diffeqfunction.jl:230 [inlined]
 [18] initialize!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,true,CuArrays.CuArray{Float32,2},Float32,CuArrays.CuArray{DiffEqBase.NullParameters,2},Float32,Float32,Float32,Array{CuArrays.CuArray{Float32,2},1},ODESolution{Float32,3,Array{CuArrays.CuArray{Float32,2},1},Nothing,Nothing,Array{Float32,1},Array{Array{CuArrays.CuArray{Float32,2},1},1},ODEProblem{CuArrays.CuArray{Float32,2},Tuple{Float32,Float32},true,CuArrays.CuArray{DiffEqBase.NullParameters,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{CuArrays.CuArray{Float32,2},1},Array{Float32,1},Array{Array{CuArrays.CuArray{Float32,2},1},1},OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}}},DiffEqBase.DEStats},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Array{Float32,1},Array{Float32,1}},CuArrays.CuArray{Float32,2},Float32,Nothing}, ::OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},CuArrays.CuArray{Float32,2},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}}) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/OrdinaryDiffEq/tQd6p/src/perform_step/low_order_rk_perform_step.jl:623
 [19] #__init#335(::Array{Float32,1}, ::Array{Float32,1}, ::Array{Float32,1}, ::Nothing, ::Bool, ::Bool, ::Bool, ::Bool, ::Nothing, ::Bool, ::Bool, ::Float32, ::Float32, ::Float32, ::Bool, ::Bool, ::Rational{Int64}, ::Nothing, ::Nothing, ::Rational{Int64}, ::Int64, ::Int64, ::Int64, ::Rational{Int64}, ::Bool, ::Int64, ::Nothing, ::Nothing, ::Int64, ::typeof(DiffEqBase.ODE_DEFAULT_NORM), ::typeof(opnorm), ::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), ::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool, ::Int64, ::String, ::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), ::Nothing, ::Bool, ::Bool, ::Bool, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(DiffEqBase.__init), ::ODEProblem{CuArrays.CuArray{Float32,2},Tuple{Float32,Float32},true,CuArrays.CuArray{DiffEqBase.NullParameters,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5, ::Array{CuArrays.CuArray{Float32,2},1}, ::Array{Float32,1}, ::Array{Any,1}, ::Type{Val{true}}) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/OrdinaryDiffEq/tQd6p/src/solve.jl:352
 [20] __init(::ODEProblem{CuArrays.CuArray{Float32,2},Tuple{Float32,Float32},true,CuArrays.CuArray{DiffEqBase.NullParameters,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5, ::Array{CuArrays.CuArray{Float32,2},1}, ::Array{Float32,1}, ::Array{Any,1}, ::Type{Val{true}}) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/OrdinaryDiffEq/tQd6p/src/solve.jl:66 (repeats 4 times)
 [21] #__solve#334 at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/OrdinaryDiffEq/tQd6p/src/solve.jl:4 [inlined]
 [22] __solve at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/OrdinaryDiffEq/tQd6p/src/solve.jl:4 [inlined]
 [23] #solve_call#425(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(DiffEqBase.solve_call), ::ODEProblem{CuArrays.CuArray{Float32,2},Tuple{Float32,Float32},true,CuArrays.CuArray{DiffEqBase.NullParameters,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/solve.jl:40
 [24] solve_call at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/solve.jl:37 [inlined]
 [25] #solve#426 at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/solve.jl:57 [inlined]
 [26] solve(::ODEProblem{CuArrays.CuArray{Float32,2},Tuple{Float32,Float32},true,CuArrays.CuArray{DiffEqBase.NullParameters,2},ODEFunction{true,getfield(DiffEqGPU, Symbol("##12#22")){ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}},UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/solve.jl:45
 [27] #batch_solve#5(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(DiffEqGPU.batch_solve), ::EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}}, ::Tsit5, ::EnsembleGPUArray, ::UnitRange{Int64}) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:66
 [28] batch_solve at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:45 [inlined]
 [29] (::getfield(DiffEqGPU, Symbol("##3#4")){Int64,Int64,Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}},Tsit5,EnsembleGPUArray})(::Int64) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:37
 [30] iterate at ./generator.jl:47 [inlined]
 [31] _collect(::UnitRange{Int64}, ::Base.Generator{UnitRange{Int64},getfield(DiffEqGPU, Symbol("##3#4")){Int64,Int64,Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}},Tsit5,EnsembleGPUArray}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:619
 [32] collect_similar(::UnitRange{Int64}, ::Base.Generator{UnitRange{Int64},getfield(DiffEqGPU, Symbol("##3#4")){Int64,Int64,Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}},Tsit5,EnsembleGPUArray}}) at ./array.jl:548
 [33] map(::Function, ::UnitRange{Int64}) at ./abstractarray.jl:2073
 [34] macro expansion at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:31 [inlined]
 [35] macro expansion at ./util.jl:213 [inlined]
 [36] #__solve#2(::Int64, ::Int64, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(DiffEqBase.__solve), ::EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}}, ::Tsit5, ::EnsembleGPUArray) at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqGPU/QB1WC/src/DiffEqGPU.jl:30
 [37] #__solve at ./none:0 [inlined]
 [38] #solve#427 at /home/bq_nbecker/isi/julia_pkg/GPUTest/packages/DiffEqBase/8uyX3/src/solve.jl:64 [inlined]
 [39] (::getfield(DiffEqBase, Symbol("#kw##solve")))(::NamedTuple{(:trajectories,),Tuple{Int64}}, ::typeof(solve), ::EnsembleProblem{ODEProblem{Array{Float32,2},Tuple{Float32,Float32},true,DiffEqBase.NullParameters,ODEFunction{true,typeof(fip),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},typeof(prob_func),getfield(DiffEqBase, Symbol("##332#338")),getfield(DiffEqBase, Symbol("##334#340")),Array{Any,1}}, ::Tsit5, ::EnsembleGPUArray) at ./none:0
 [40] top-level scope at In[3]:15

RuntimeGeneratedFunctions are not compatible with CUDA.jl (ModelingToolkit.jl generated functions)

Hello all! This is related to this post where I was attempting to recreate the lorentz equations example in the DiffEqGPU.jl but instead of providing numerical functions I wanted to use ModelingToolkit.jl to handle the jacobian and time gradient.

Here's how to reproduce the error:

function lorenz(du,u,p,t)
    du[1] = p[1]*(u[2]-u[1])
    du[2] = u[1]*(p[2]-u[3]) - u[2]
    du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
sys = modelingtoolkitize(prob)
func = ODEFunction(sys,jac=true,tgrad=true)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
prob_jac = ODEProblem(func,u0,tspan,p)
monteprob_jac = EnsembleProblem(prob_jac, prob_func = prob_func)
@time solve(monteprob_jac,Rodas5(),EnsembleGPUArray(),dt=0.1,trajectories=10_000,saveat=1.0f0)

ERROR: GPU compilation of kernel #gpu_gpu_kernel(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ModelingToolkit.var"#f#396"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋out, :ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, CUDA.CuDeviceMatrix{Float32, 1}, Float32) failed
KernelError: passing and using non-bitstype argument

Argument 3 to your kernel function is of type ModelingToolkit.var"#f#396"{RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}, RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋out, 
:ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr}}, which is not isbits:
  .f_oop is of type RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr} which is not isbits.
    .body is of type Expr which is not isbits.
      .head is of type Symbol which is not isbits.
      .args is of type Vector{Any} which is not isbits.
  .f_iip is of type RuntimeGeneratedFunctions.RuntimeGeneratedFunction{(:ˍ₋out, :ˍ₋arg1, :ˍ₋arg2, :t), ModelingToolkit.var"#_RGF_ModTag", ModelingToolkit.var"#_RGF_ModTag", Expr} which is not isbits.
    .body is of type Expr which is not isbits.
      .head is of type Symbol which is not isbits.
      .args is of type Vector{Any} which is not isbits.

As @ChrisRackauckas pointed out:

Seems like the way we automatically generate the functions doesn’t play nicely with CUDA.jl

Support terminate!

We will need to think about this one. Actually terminating will cause the others not to finish. I think that terminate should just set an array to remember the time, and then we truncate back to the terminated times after solving.

Simple SDE only works with one trajectory in ensemble

Based on the DiffEq documentation, I've written this minimal example. If I set trajectories = 1, it compiles properly and everything is fine. But when trajectories = 32, say, I get errors relating to calling overdub from within the kernel. I'm very much a Julia noob, but this seems weird to me. The CUDA debugging settings aren't really helping shed any light on the situation. I've been working on this all day and can't seem to make forward progress. Hoping to get some help here!

Example code:

using CUDA
using Plots
using DifferentialEquations
using DiffEqGPU

n = 16

function μ_cox(du, u, p, t)
    du .= -1.0f0 * u
    return nothing
end

function σ_cox(du, u, p, t)
    du .= 0.1f0 * u .^ 2;
    return nothing
end

params = CUDA.zeros(Float32, 1)
initial = CUDA.ones(Float32, n)
prob = SDEProblem(μ_cox, σ_cox, initial, (0.0f0, 1.0f1), p = params)

ensembleprob = EnsembleProblem(prob, safetycopy = false)
sol = solve(ensembleprob, Tsit5(), EnsembleGPUArray(), trajectories = 32)

Example error:

InvalidIRError: compiling kernel gpu_gpu_kernel(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(μ_cox), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] overdub
   @ ~/.julia/packages/Cassette/r4kKQ/src/overdub.jl:635
 [2] throw_inexacterror(::Symbol, ::Type{UInt64}, ::Int64)
   @ ./boot.jl:602
 [3] overdub
   @ ./boot.jl:602
 [4] overdub
   @ ~/.julia/packages/Cassette/r4kKQ/src/overdub.jl:0
 [5] multiple call sites
   @ unknown:0

Unitful support

using OrdinaryDiffEq, DiffEqGPU, Test, Unitful

function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = p[1]*(u[2]-u[1])
     du[2] = u[1]*(p[2]-u[3]) - u[2]
     du[3] = u[1]*u[2] - p[3]*u[3]
 end
 nothing
end

u0 = [1f0u"m";0u"m";0u"m"]
tspan = (0.0f0u"s",100.0f0u"s")
p = (10.0f0,28.0f0,8/3f0)
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
@test_broken sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0u"s")

Scalar indexing in continuous callbacks with EnsembleGPUArray

29d90b2

is marks continuous callback usage broken because of scalar indexing in https://github.com/JuliaDiffEq/DiffEqBase.jl/blob/master/src/init.jl#L150 . @maleadt I thought this was fixed on CuArrays.jl?

EnsembleGPUArray() doesn't work for EnsembleProblem() over different initial conditions

This is the code to define an ensemble problem over different initial conditions:

using Pkg
Pkg.activate(".")
using DiffEqGPU, DifferentialEquations
function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = p[1]*(u[2]-u[1])
     du[2] = u[1]*(p[2]-u[3]) - u[2]
     du[3] = u[1]*u[2] - p[3]*u[3]
 end
 nothing
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = (10.0f0,28.0f0,8/3f0)
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,u0=u0 .+ 0.1f0.*rand(Float32,3))
monteprob = EnsembleProblem(prob, prob_func = prob_func)
@time sol_cpu = solve(monteprob,Tsit5(),EnsembleThreads(),trajectories=10_000,saveat=1.0f0)
@time sol_gpu = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

which works with EnsembleThreads() but not with EnsembleGPUArray. The error is:

Activating environment at `~/Documents/Work/NormalFormAE/Project.toml`
 26.791530 seconds (91.47 M allocations: 4.315 GiB, 4.67% gc time)
ERROR: LoadError: InvalidIRError: compiling gpu_kernel(Cassette.Context{nametype(Ctx),Nothing,Nothing,GPUifyLoops.var"##PassType#253",Nothing,Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_kernel), typeof(lorenz), CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Tuple{Float32,Float32,Float32},2,CUDAnative.AS.Global}, Float32) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to *)
Stacktrace:
 [1] call at /home/manu/.julia/packages/Cassette/158rp/src/context.jl:456
 [2] fallback at /home/manu/.julia/packages/Cassette/158rp/src/context.jl:454
 [3] _overdub_fallback at /home/manu/.julia/packages/Cassette/158rp/src/overdub.jl:536
 [4] lorenz at /home/manu/Documents/Work/NormalFormAE/test/DiffEqGPU_test.jl:6
 [5] gpu_kernel at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:9
 [6] overdub at /home/manu/.julia/packages/Cassette/158rp/src/overdub.jl:0
Stacktrace:
 [1] check_ir(::CUDAnative.CompilerJob, ::LLVM.Module) at /home/manu/.julia/packages/CUDAnative/ierw8/src/compiler/validation.jl:116
 [2] macro expansion at /home/manu/.julia/packages/CUDAnative/ierw8/src/compiler/driver.jl:186 [inlined]
 [3] macro expansion at /home/manu/.julia/packages/TimerOutputs/kJdxU/src/TimerOutput.jl:245 [inlined]
 [4] codegen(::Symbol, ::CUDAnative.CompilerJob; libraries::Bool, dynamic_parallelism::Bool, optimize::Bool, strip::Bool, strict::Bool) at /home/manu/.julia/packages/CUDAnative/ierw8/src/compiler/driver.jl:184
 [5] compile(::Symbol, ::CUDAnative.CompilerJob; libraries::Bool, dynamic_parallelism::Bool, optimize::Bool, strip::Bool, strict::Bool) at /home/manu/.julia/packages/CUDAnative/ierw8/src/compiler/driver.jl:45
 [6] #compile#171 at /home/manu/.julia/packages/CUDAnative/ierw8/src/compiler/driver.jl:33 [inlined]
 [7] cufunction_slow(::Function, ::Type{T} where T, ::Int64; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/manu/.julia/packages/CUDAnative/ierw8/src/execution.jl:326
 [8] #219 at /home/manu/.julia/packages/CUDAnative/ierw8/src/execution.jl:393 [inlined]
 [9] get!(::CUDAnative.var"#219#220"{String,Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},typeof(Cassette.overdub),DataType,Int64}, ::Dict{UInt64,CUDAnative.HostKernel}, ::UInt64) at ./dict.jl:452
 [10] macro expansion at /home/manu/.julia/packages/CUDAnative/ierw8/src/execution.jl:392 [inlined]
 [11] macro expansion at ./lock.jl:183 [inlined]
 [12] cufunction_fast(::Function, ::Type{T} where T, ::Int64; name::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/manu/.julia/packages/CUDAnative/ierw8/src/execution.jl:391
 [13] getproperty at ./Base.jl:33 [inlined]
 [14] merge at ./namedtuple.jl:235 [inlined]
 [15] cufunction(::typeof(Cassette.overdub), ::Type{Tuple{Cassette.Context{nametype(Ctx),Nothing,Nothing,GPUifyLoops.var"##PassType#253",Nothing,Cassette.DisableHooks},typeof(DiffEqGPU.gpu_kernel),typeof(lorenz),CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Float32,2,CUDAnative.AS.Global},CUDAnative.CuDeviceArray{Tuple{Float32,Float32,Float32},2,CUDAnative.AS.Global},Float32}}; kwargs::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol},NamedTuple{(:name,),Tuple{String}}}) at /home/manu/.julia/packages/CUDAnative/ierw8/src/execution.jl:0
 [16] launch(::GPUifyLoops.CUDA, ::typeof(DiffEqGPU.gpu_kernel), ::Function, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/manu/.julia/packages/GPUifyLoops/cxUhR/src/GPUifyLoops.jl:126
 [17] launch at /home/manu/.julia/packages/GPUifyLoops/cxUhR/src/GPUifyLoops.jl:120 [inlined]
 [18] macro expansion at /home/manu/.julia/packages/GPUifyLoops/cxUhR/src/GPUifyLoops.jl:54 [inlined]
 [19] #28 at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:219 [inlined]
 [20] ODEFunction at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/diffeqfunction.jl:248 [inlined]
 [21] initialize!(::OrdinaryDiffEq.ODEIntegrator{Tsit5,true,CuArrays.CuArray{Float32,2,Nothing},Nothing,Float32,CuArrays.CuArray{Tuple{Float32,Float32,Float32},2,Nothing},Float32,Float32,Float32,Array{CuArrays.CuArray{Float32,2,Nothing},1},ODESolution{Float32,3,Array{CuArrays.CuArray{Float32,2,Nothing},1},Nothing,Nothing,Array{Float32,1},Array{Array{CuArrays.CuArray{Float32,2,Nothing},1},1},ODEProblem{CuArrays.CuArray{Float32,2,Nothing},Tuple{Float32,Float32},true,CuArrays.CuArray{Tuple{Float32,Float32,Float32},2,Nothing},ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tsit5,OrdinaryDiffEq.InterpolationData{ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Array{CuArrays.CuArray{Float32,2,Nothing},1},Array{Float32,1},Array{Array{CuArrays.CuArray{Float32,2,Nothing},1},1},OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}}},DiffEqBase.DEStats},ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}},OrdinaryDiffEq.DEOptions{Float32,Float32,Float32,Float32,typeof(DiffEqBase.ODE_DEFAULT_NORM),typeof(LinearAlgebra.opnorm),CallbackSet{Tuple{},Tuple{}},typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN),typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE),typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK),DataStructures.BinaryHeap{Float32,DataStructures.LessThan},DataStructures.BinaryHeap{Float32,DataStructures.LessThan},Nothing,Nothing,Int64,Array{Float32,1},Float32,Array{Float32,1}},CuArrays.CuArray{Float32,2,Nothing},Float32,Nothing,OrdinaryDiffEq.DefaultInit}, ::OrdinaryDiffEq.Tsit5Cache{CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,2,Nothing},OrdinaryDiffEq.Tsit5ConstantCache{Float32,Float32}}) at /home/manu/.julia/packages/OrdinaryDiffEq/z9kDO/src/perform_step/low_order_rk_perform_step.jl:623
 [22] __init(::ODEProblem{CuArrays.CuArray{Float32,2,Nothing},Tuple{Float32,Float32},true,CuArrays.CuArray{Tuple{Float32,Float32,Float32},2,Nothing},ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5, ::Array{CuArrays.CuArray{Float32,2,Nothing},1}, ::Array{Float32,1}, ::Array{Any,1}, ::Type{Val{true}}; saveat::Float32, tstops::Array{Float32,1}, d_discontinuities::Array{Float32,1}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Bool, callback::Nothing, dense::Bool, calck::Bool, dt::Float32, dtmin::Nothing, dtmax::Float32, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta1::Nothing, beta2::Nothing, maxiters::Int64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), internalopnorm::typeof(LinearAlgebra.opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/manu/.julia/packages/OrdinaryDiffEq/z9kDO/src/solve.jl:402
 [23] #__solve#352 at /home/manu/.julia/packages/OrdinaryDiffEq/z9kDO/src/solve.jl:4 [inlined]
 [24] (::DiffEqBase.var"#466#467"{ODEProblem{CuArrays.CuArray{Float32,2,Nothing},Tuple{Float32,Float32},true,CuArrays.CuArray{Tuple{Float32,Float32,Float32},2,Nothing},ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},Tuple{Tsit5}})() at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/solve.jl:49
 [25] maybe_with_logger at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/utils.jl:259 [inlined]
 [26] solve_call(::ODEProblem{CuArrays.CuArray{Float32,2,Nothing},Tuple{Float32,Float32},true,CuArrays.CuArray{Tuple{Float32,Float32,Float32},2,Nothing},ODEFunction{true,DiffEqGPU.var"#28#32"{typeof(lorenz)},LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem}, ::Tsit5; merge_callbacks::Bool, kwargs::Base.Iterators.Pairs{Symbol,Union{Nothing, Float32},Tuple{Symbol,Symbol},NamedTuple{(:callback, :saveat),Tuple{Nothing,Float32}}}) at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/solve.jl:48
 [27] #solve#468 at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/solve.jl:66 [inlined]
 [28] batch_solve(::EnsembleProblem{ODEProblem{Array{Float32,1},Tuple{Float32,Float32},true,Tuple{Float32,Float32,Float32},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},var"#3#4",DiffEqBase.var"#342#348",DiffEqBase.var"#344#350",Array{Any,1}}, ::Tsit5, ::EnsembleGPUArray, ::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol,Float32,Tuple{Symbol},NamedTuple{(:saveat,),Tuple{Float32}}}) at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:206
 [29] (::DiffEqGPU.var"#4#6"{Int64,Int64,Base.Iterators.Pairs{Symbol,Float32,Tuple{Symbol},NamedTuple{(:saveat,),Tuple{Float32}}},EnsembleProblem{ODEProblem{Array{Float32,1},Tuple{Float32,Float32},true,Tuple{Float32,Float32,Float32},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},var"#3#4",DiffEqBase.var"#342#348",DiffEqBase.var"#344#350",Array{Any,1}},Tsit5,EnsembleGPUArray})(::Int64) at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:149
 [30] iterate at ./generator.jl:47 [inlined]
 [31] _collect(::UnitRange{Int64}, ::Base.Generator{UnitRange{Int64},DiffEqGPU.var"#4#6"{Int64,Int64,Base.Iterators.Pairs{Symbol,Float32,Tuple{Symbol},NamedTuple{(:saveat,),Tuple{Float32}}},EnsembleProblem{ODEProblem{Array{Float32,1},Tuple{Float32,Float32},true,Tuple{Float32,Float32,Float32},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},var"#3#4",DiffEqBase.var"#342#348",DiffEqBase.var"#344#350",Array{Any,1}},Tsit5,EnsembleGPUArray}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./array.jl:678
 [32] collect_similar(::UnitRange{Int64}, ::Base.Generator{UnitRange{Int64},DiffEqGPU.var"#4#6"{Int64,Int64,Base.Iterators.Pairs{Symbol,Float32,Tuple{Symbol},NamedTuple{(:saveat,),Tuple{Float32}}},EnsembleProblem{ODEProblem{Array{Float32,1},Tuple{Float32,Float32},true,Tuple{Float32,Float32,Float32},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},var"#3#4",DiffEqBase.var"#342#348",DiffEqBase.var"#344#350",Array{Any,1}},Tsit5,EnsembleGPUArray}}) at ./array.jl:607
 [33] map(::Function, ::UnitRange{Int64}) at ./abstractarray.jl:2072
 [34] macro expansion at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:143 [inlined]
 [35] macro expansion at ./util.jl:234 [inlined]
 [36] __solve(::EnsembleProblem{ODEProblem{Array{Float32,1},Tuple{Float32,Float32},true,Tuple{Float32,Float32,Float32},ODEFunction{true,typeof(lorenz),LinearAlgebra.UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},DiffEqBase.StandardODEProblem},var"#3#4",DiffEqBase.var"#342#348",DiffEqBase.var"#344#350",Array{Any,1}}, ::Tsit5, ::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, kwargs::Base.Iterators.Pairs{Symbol,Float32,Tuple{Symbol},NamedTuple{(:saveat,),Tuple{Float32}}}) at /home/manu/.julia/packages/DiffEqGPU/senL8/src/DiffEqGPU.jl:142
 [37] #solve#469 at /home/manu/.julia/packages/DiffEqBase/H1xRL/src/solve.jl:80 [inlined]
 [38] top-level scope at ./util.jl:175
 [39] include(::Module, ::String) at ./Base.jl:377
 [40] exec_options(::Base.JLOptions) at ./client.jl:288
 [41] _start() at ./client.jl:484
in expression starting at /home/manu/Documents/Work/NormalFormAE/test/DiffEqGPU_test.jl:21

My packages are:

  [c7e460c6] ArgParse v1.1.0
  [fbb218c0] BSON v0.2.6
  [6e4b80f9] BenchmarkTools v0.5.0
  [3895d2a7] CUDAapi v4.0.0
  [c5f51814] CUDAdrv v6.2.2
  [be33ccc6] CUDAnative v3.0.4
  [3a865a2d] CuArrays v2.0.1
  [071ae1c0] DiffEqGPU v1.3.0
  [0c46a032] DifferentialEquations v6.13.0
  [31c24e10] Distributions v0.23.2
  [5789e2e9] FileIO v1.2.4
  [587475ba] Flux v0.10.4
  [0c68f7d7] GPUArrays v3.1.0
  [033835bb] JLD2 v0.1.13
  [429524aa] Optim v0.20.6
  [91a5bcdd] Plots v1.0.11
  [8d666b04] PolyChaos v0.2.1
  [295af30f] Revise v2.6.1
  [9f7883ad] Tracker v0.2.6
  [e88e6eb3] Zygote v0.4.15
  [9a3f8284] Random

LinSolveGPUSplitFactorize factorization holder?

https://github.com/JuliaDiffEq/DiffEqGPU.jl/blob/master/src/DiffEqGPU.jl#L190

That array is causing issues in the kernels trying to be called because it exists on the CPU, but you cannot do an Array{CuArrays.CUSOLVER.CuQR{T,CuArray{T,2}}} on the GPU. @vchuravy is there a workaround for this with GPUifyLoops?

Implicit ODE solver speed

using DiffEqGPU, CuArrays, OrdinaryDiffEq, Test

function rober_f(internal_var___du,internal_var___u,internal_var___p,t)
    @inbounds begin
        internal_var___du[1] = -(internal_var___p[1]) * internal_var___u[1] + internal_var___p[3] * internal_var___u[2] * internal_var___u[3]
        internal_var___du[2] = (internal_var___p[1] * internal_var___u[1] - internal_var___p[2] * internal_var___u[2] ^ 2) - internal_var___p[3] * internal_var___u[2] * internal_var___u[3]
        internal_var___du[3] = internal_var___p[2] * internal_var___u[2] ^ 2
    end
    nothing
end

function rober_jac(internal_var___J,internal_var___u,internal_var___p,t)
    @inbounds begin
        internal_var___J[1, 1] = -(internal_var___p[1])
        internal_var___J[1, 2] = internal_var___p[3] * internal_var___u[3]
        internal_var___J[1, 3] = internal_var___p[3] * internal_var___u[2]
        internal_var___J[2, 1] = internal_var___p[1] * 1
        internal_var___J[2, 2] = -2 * internal_var___p[2] * internal_var___u[2] - internal_var___p[3] * internal_var___u[3]
        internal_var___J[2, 3] = -(internal_var___p[3]) * internal_var___u[2]
        internal_var___J[3, 1] = 0 * 1
        internal_var___J[3, 2] = 2 * internal_var___p[2] * internal_var___u[2]
        internal_var___J[3, 3] = 0 * 1
    end
    nothing
end

CuArrays.allowscalar(false)
rober_prob = ODEProblem(ODEFunction(rober_f,jac=rober_jac),Float32[1.0,0.0,0.0],(0.0,1f5),(4f-2,3f7,1f4))
sol = solve(rober_prob,Rodas5(),abstol=1f-8,reltol=1f-8)
const pre_p = [rand(Float32,3) for i in 1:100_000]
prob_func = (prob,i,repeat) -> remake(prob,p=pre_p[i].*p)

rober_monteprob = EnsembleProblem(rober_prob, prob_func = prob_func)

@time sol = solve(rober_monteprob,TRBDF2(linsolve=LinSolveGPUSplitFactorize()),
                  EnsembleGPUArray(),trajectories=100,saveat=1.0f0,abstol=1f-8,reltol=1f-8)
@time sol = solve(rober_monteprob,TRBDF2(linsolve=LinSolveGPUSplitFactorize()),
                  EnsembleCPUArray(),trajectories=100,saveat=1.0f0,abstol=1f-8,reltol=1f-8)
@time sol = solve(monteprob,TRBDF2(),EnsembleThreads(),trajectories=100,
                  abstol=1e-8,reltol=1e-8,saveat=1.0f0)

Currently on stiff ODEs the GPU version is slower.

Compilation of affect! closures for EnsembleGPUArray

EnsembleGPUArrays do not compile when using a closure for the affect! function.
The example does work with EnsembleThreads.

using DifferentialEquations
function f!(du,u,p,t)
    du[1] = -u[1] + p[1]
    nothing
end
function affect!_generator(p_parametrized,tstops)
    function affect!(integrator)
        k = findfirst(isequal(integrator.t),tstops)
        integrator.p[:] .= p_parametrized[k]
    end
end
N = 10
u0 = [1.0]
tspan = (0.0,50.0)
p_parametrized = rand(5,1)
p = [p_parametrized[1]]
prob = ODEProblem(f!,u0,tspan,p)
tstops = 0.0:10.0:40.0
affect! = affect!_generator(p_parametrized,tstops)
cb = PresetTimeCallback(tstops,affect!;save_positions = (false,false))

const pre_u0 = [rand(1) for i in 1:N]
prob_func = (prob,i,repeat) -> remake(prob,u0=pre_u0[i].*u0)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
sol = solve(monteprob,Tsit5(),EnsembleThreads(),callback = cb,trajectories=N)
using Plots
plot(sol)

using CuArrays, DiffEqGPU 
CuArrays.allowscalar(false)
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),callback = cb,trajectories=N)

Minimal working EnsembleGPUArray() example

Hello, can not get the EnsembleGPUArray() to work on any of my machines. Simple solve() function on CUDA array works fine, however, when asked to do ensemble on gpu, it gives error "unsupported call to the Julia runtime (call to jl_f_tuple)".

Code:

using DifferentialEquations, CUDA, LinearAlgebra, DiffEqGPU
u0 = cu(rand(1000))
A  = cu(randn(1000,1000))
function f(du,u,p,t) 
    mul!(du,A,u)
    return nothing
end
prob = ODEProblem(f,u0,(0.0f0,1.0f0))
sol = solve(prob, Tsit5()) # this works
monteprob = EnsembleProblem(prob, safetycopy=false)
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10000,saveat=1.0f0)

Output:

RROR: LoadError: InvalidIRError: compiling kernel gpu_gpu_kernel(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
...
Stacktrace:
  [1] ntuple(::Adapt.var"
    @ ntuple.jl:19
  [2] overdub
    @ ntuple.jl:19
  [3] overdub
    @ ~/.julia/packages/Adapt/orWD2/src/base.jl:16
  [4] adapt(::KernelAbstractions.ConstAdaptor, ::typeof(f))
    @ ~/.julia/packages/Adapt/orWD2/src/Adapt.jl:40
  [5] overdub
    @ ~/.julia/packages/Adapt/orWD2/src/Adapt.jl:40
  [6] constify(::typeof(f))
    @ ~/.julia/packages/KernelAbstractions/fGTLM/src/KernelAbstractions.jl:313
  [7] overdub
    @ ~/.julia/packages/KernelAbstractions/fGTLM/src/KernelAbstractions.jl:313
  [8] gpu_gpu_kernel(::typeof(f), ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{Float32, 1}, ::CuDeviceMatrix{SciMLBase.NullParameters, 1}, ::Float32)
    @ none:0
  [9] overdub
    @ none:0
 [10] overdub
    @ ~/.julia/packages/Cassette/jxIEh/src/overdub.jl:0
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Cassette.overdub), Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/validation.jl:123
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:288 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/4QAIk/src/TimerOutput.jl:206 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/driver.jl:286 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module, kernel::LLVM.Function; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XwWPj/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:306
  [7] check_cache
    @ ~/.julia/packages/GPUCompiler/XwWPj/src/cache.jl:44 [inlined]
  [8] cached_compilation
    @ ./none:0 [inlined]
  [9] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel), typeof(f), CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{SciMLBase.NullParameters, 1}, Float32}}; name::String, kwargs::Base.Iterators.Pairs{Symbol, Nothing, Tuple{Symbol}, NamedTuple{(:maxthreads,), Tuple{Nothing}}})
    @ CUDA ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:294
 [10] macro expansion
    @ ~/.julia/packages/CUDA/qEV3Y/src/compiler/execution.jl:102 [inlined]
 [11] (::KernelAbstractions.Kernel{KernelAbstractions.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::KernelAbstractions.CudaEvent, workgroupsize::Int64, progress::Function)
    @ KernelAbstractions ~/.julia/packages/KernelAbstractions/fGTLM/src/backends/cuda.jl:187
 [12] (::DiffEqGPU.var"#55#59"{typeof(f)})(du::CuArray{Float32, 2}, u::CuArray{Float32, 2}, p::CuArray{SciMLBase.NullParameters, 2}, t::Float32)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:356
 [13] ODEFunction
    @ ~/.julia/packages/SciMLBase/zGIUN/src/scimlfunctions.jl:334 [inlined]
 [14] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CuArray{Float32, 2}, Nothing, Float32, CuArray{SciMLBase.NullParameters, 2}, Float32, Float32, Float32, Vector{CuArray{Float32, 2}}, ODESolution{Float32, 3, Vector{CuArray{Float32, 2}}, Nothing, Nothing, Vector{Float32}, Vector{Vector{CuArray{Float32, 2}}}, ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CuArray{Float32, 2}}, Vector{Float32}, Vector{Vector{CuArray{Float32, 2}}}, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}}, OrdinaryDiffEq.DEOptions{Float32, Float32, Float32, Float32, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#10#16", DataStructures.BinaryMinHeap{Float32}, DataStructures.BinaryMinHeap{Float32}, Nothing, Nothing, Int64, Tuple{}, Float32, Tuple{}}, CuArray{Float32, 2}, Float32, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{Float32, 2}, CuArray{Float32, 2}, CuArray{Float32, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float32, Float32}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/5egkj/src/perform_step/low_order_rk_perform_step.jl:623
 [15] __init(prob::ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Float32, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float32, dtmin::Nothing, dtmax::Float32, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta1::Nothing, beta2::Nothing, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#10#16", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/5egkj/src/solve.jl:433
 [16] #__solve#404
    @ ~/.julia/packages/OrdinaryDiffEq/5egkj/src/solve.jl:4 [inlined]
 [17] #solve_call#56
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:61 [inlined]
 [18] solve_up(prob::ODEProblem{CuArray{Float32, 2}, Tuple{Float32, Float32}, true, CuArray{SciMLBase.NullParameters, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(f)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{Float32, 2}, p::CuArray{SciMLBase.NullParameters, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:unstable_check, :saveat, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#10#16", Float32, Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:82
 [19] #solve#57
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:70 [inlined]
 [20] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{Float32}, p::Matrix{SciMLBase.NullParameters}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:261
 [21] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:unstable_check, :saveat), Tuple{DiffEqGPU.var"#10#16", Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:226
 [22] macro expansion
    @ ./timing.jl:287 [inlined]
 [23] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{Float32}, Tuple{Float32, Float32}, true, SciMLBase.NullParameters, ODEFunction{true, typeof(f), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(SciMLBase.DEFAULT_PROB_FUNC), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Symbol, Float32, Tuple{Symbol}, NamedTuple{(:saveat,), Tuple{Float32}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/eOLep/src/DiffEqGPU.jl:143
 [24] #solve#59
    @ ~/.julia/packages/DiffEqBase/rN9Px/src/solve.jl:96 [inlined]
 [25] top-level scope
    @ ./timing.jl:210
in expression starting at /home/mantas/mantas/d2julia/gpu_test.jl:10

Sizable performance regression from KernelAbstractions update

using DiffEqGPU, OrdinaryDiffEq
function lorenz(du,u,p,t)
 @inbounds begin
     du[1] = p[1]*(u[2]-u[1])
     du[2] = u[1]*(p[2]-u[3]) - u[2]
     du[3] = u[1]*u[2] - p[3]*u[3]
 end
 nothing
end

u0 = Float32[1.0;0.0;0.0]
tspan = (0.0f0,100.0f0)
p = [10.0f0,28.0f0,8/3f0]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)

using BenchmarkTools
@btime sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10_000,saveat=1.0f0)

# Before
1.875 s (3299926 allocations: 170.77 MiB)

# After
2.218 s (3388938 allocations: 166.80 MiB)

@vchuravy is this known?

Ensemble{C,G}PUArray and EnsembleThreads differ in sharing of parameter values

i generated a set of ODEs for a reaction network using DiffEqBiological. when building a parameter ensemble in which rate constants are randomly varied, i noticed that the DiffEqGPU ensemble algorithms always produced the same output, not respecting the modifications to the parameter vector prob.p of the ODEProblem made by my prob_func. EnsembleThreads did perform the modifications, as expected.

below is a runnable example. the final plots should show that EnsembleThreads really generates an ensemble whereas Ensemble{C,G}PU generate repeated copies of identical runs.

am i doing it wrong?

using DifferentialEquations, DiffEqBiological, DiffEqGPU, CuArrays, Plots
CuArrays.allowscalar(false)

# build reaction network step by step
rn = @empty_reaction_network
addparam!(rn, :a)
addparam!(rn, :b)
addspecies!(rn, :g0)
addspecies!(rn, :g1)
addreaction!(rn, :a, :(g0 --> g1))
addreaction!(rn, :a, :(g1 --> 2g0))
addodes!(rn)
# have a look
species(rn)
odeexprs(rn)

# initial cond
n = length(species(rn))
u0 = zeros(Float32, n)
u0[1] = 1
# time span
tspan = (0.f0, 80.f0)
# initial parameter vector
pars = Float32[0.1, 0.2]
# build the problem
prob = ODEProblem(rn, u0, tspan, pars)
prob.p
# solve one instance for testing
sol = solve(prob, Tsit5());
plot(sol)
# prepare random modifications of the parameters
using Random, Distributions
rng = Random.seed!(124)
lnd = LogNormal(0., .2)
rand(rng, lnd)
# implement random parameter modification
function prob_func(prob, i, repeat)
    pp = copy(prob.p)
    pp .*= rand(rng, lnd, 2)
    remake(prob; p=pp)
end
# test copying
pp = prob_func(prob, 0, 0)
(prob.p, pp.p)
# finally, make an ensemble problem
ens_prob = EnsembleProblem(prob, prob_func=prob_func)
# solutions with different ensemble algorithms
sim_t = solve(ens_prob, Tsit5(), EnsembleThreads(), trajectories=10)
sim_ca = solve(ens_prob, Tsit5(), EnsembleCPUArray(), trajectories=10)
sim_ga = solve(ens_prob, Tsit5(), EnsembleGPUArray(), trajectories=10);
# test
plot(sim_t)
plot(sim_ca)
plot(sim_ga)

Iteration doesn't work in RHS function

Full MWE

using OrdinaryDiffEq
using DifferentialEquations.EnsembleAnalysis
using DiffEqGPU
using CuArrays

Base.@propagate_inbounds function lotka_volterra2(du,u,p,t)
  x,y = u
  du[1] =  p[1]*x - p[2]*x*y
  du[2] = -p[3]*y + p[4]*x*y
return nothing
end
u0 = [1.0,1.0]
tspan = (0.0,10.0)
p = [1.0,1.0,1.0,1.0]

prob = ODEProblem(lotka_volterra2,u0,tspan,p)

ensemble_problem = EnsembleProblem(prob)
ensemble_solution = solve(ensemble_problem,Tsit5(),EnsembleGPUArray(),trajectories = 10)

Segfault when using ModelingToolkit

Is MTK compatible with DiffEqGPU?

using ModelingToolkit
using OrdinaryDiffEq
using DiffEqGPU

@parameters t σ ρ β
@variables x(t) y(t) z(t)
@derivatives D'~t

eqs = [D(x) ~ σ*(y-x),
       D(y) ~ x*(ρ-z)-y,
       D(z) ~ x*y - β*z]

sys = ODESystem(eqs)

u0 = [x => 1.0f0
      y => 0.0f0
      z => 0.0f0]

p  = [σ => 10.0f0
      ρ => 28.0f0
      β => 8f0/3f0]
tspan = (0.0f0,100.0f0)
prob = ODEProblem(sys,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,p=rand(Float32,3).*prob.p)
monteprob = EnsembleProblem(prob, prob_func = prob_func)
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=10,saveat=1.0f0)