juliaattic / cudart.jl Goto Github PK

View Code? Open in Web Editor NEW

79.0 9.0 29.0 333 KB

Julia wrapper for CUDA runtime API

License: Other

Julia 99.82% Cuda 0.18%

julia gpu cuda

cudart.jl's Introduction

CUDArt

IMPORTANT NOTE: this package is not actively developed, please use CUDAdrv instead!

Build status:

Code coverage:

This package wraps the CUDA runtime API. For a wrapper of the driver API, see CUDAdrv.

CUDAdrv.jl is the preferred way to program a GPU from Julia; Only use CUDArt.jl if you really require the runtime API.

Platform support

This has been tested on Linux, OSX, and Windows. With Windows, at least Visual Studio 2010/2012/2013/2015 are supported.

Installation

First, you need to have an NVIDIA GPU device in your computer (one that is available for computation, i.e., most likely not your graphics card), and the CUDA library installed. You have to perform these steps manually. Choose either 32-bit or 64-bit versions to match your Julia installation.

Install the Julia package using:

Pkg.add("CUDArt")

During installation, it should compile a couple of files in the deps/ directory. These files provide utility functions necessary for certain functionality in this package. If the build step fails, try fixing the problems and running Pkg.build("CUDArt") manually.

After installation, it's probably a good idea to run the test/runtests.jl script to find out whether everything is working on your system, or just say Pkg.test("CUDArt").

In case of errors, one thing to check is your CUDA installation itself. For example, examine whether the *.ptx files are present in deps/ and test/; look at those files and make sure they seem appropriate. (E.g., if your computer is 64-bit, are they compiled for 64-bit?)

Usage

Start by saying using CUDArt, or import CUDArt if you prefer to qualify everything with the module name. For most use cases, you'll also need to install and import the CUDAdrv package, which among other things provides functionality to launch kernels.

GPU initialization

One or more GPUs can be initialized, used for computations, and freed for other uses. There are some complexities in this process due to the interaction with Julia's garbage collection---a CUDA array object allocated in one "session" should not be usable if you close the device and then open a new "session." Fortunately, CUDArt should make the process transparent, and as a user you shouldn't have to think about this at all.

The easiest way to ensure that you get full functionality, with proper cleanup of resources, is by using the do block syntax:

result = devices(dev->true) do devlist
    # Code that does GPU computations
end

The argument to devices is a function that accepts an integer input (the integer representing the CUDA device, starting with 1) and returns true or false, indicating whether the device should or should not be used, respectively. dev->true means that very device will be used. The devlist variable will be defined inside the block, and is a Vector{Int} of the available devices.

If you need to make sure that only devices with sufficient capabilities are used, then use a construct like this:

result = devices(dev->capability(dev)[1]>=2) do devlist
    # Code that does GPU computations
end

This will select all devices that have a major capability of 2 or higher. You can query any of the properties of your device; see the device_properties and attribute functions and the list of fields. If you want to restrict your computations to just one device (perhaps leaving other devices for other users), use the nmax keyword:

result = devices(func, nmax=1) do devlist
    # Code that does GPU computations
end

Finally, you can request only those devices that are not busy with other tasks using:

result = devices(func, status=:free) do devlist
    # Code that does GPU computations
end

You can wait for specific devices to become available with wait_free(devlist).

The do block syntax initializes the devices and loads some utility functions (defined in deps/utils.cu) onto each GPU; it also ensures proper freeing of memory and unloading of code when the do block finishes. Should you want to initialize the utilities manually, you can do so by calling CUDArt.init(devlist) and CUDArt.close(devlist) where devlist is an integer device number or a list of them, e.g. 0 or [0,1]. This can be handy in case of trouble, because unfortunately the do syntax does not usually result in ideal backtraces.

If your work doesn't require any of the utility functions, you can manually manage the device:

device(dev)
# Code that does GPU computations
device_reset(dev)

where dev is the integer device number.

Choosing/querying the active device

At any point in your code, the command device(dev) makes dev the active device. For example, commands that allocate device memory will be executed on whichever device is currently active.

Calling dev = device() will return the currently-active device

Arrays

Device arrays

CUDArt supports two main types of device arrays: CudaArrays and CudaPitchedArrays. These correspond to contiguous memory blocks and "pitched pointers", respectively.

To declare an uninitialized array on the device, use:

d_A = CudaArray(Float64, (200,300))
d_B = CudaPitchedArray(Int32, (15, 40, 27))

The d_ is a conventional way of reminding yourself that the array is allocated on the device. To copy a host array to the device, use any of

d_A = CudaArray(A)
d_AP = CudaPitchedArray(A)
copy!(d_A, A)
copy!(d_AP, A)

To copy a device array back to the host, use either of

A = to_host(d_A)
copy!(A, d_A)

Most of the typical Julia functions, like size, ndims, reinterpret, eltype, fill!, etc., work on CUDA array types. One noteworthy omission is that you can't directly index a CUDA array: d_A[2,4] will fail. This is not supported because host/device memory transfers are relatively slow, and you don't want to write code that (on the host side) makes use of individual elements in a device array. If you want to inspect the values in a device array, first use to_host to copy it to host memory.

You can find out which device is storing an array using:

dev = device(d_A)

Host arrays

Another important array type is the HostArray, which is allocated by the CUDA library using pinned memory:

h_A = HostArray(Float32, (1000,1200))

There are circumstances where using a HostArray may improve the speed of memory transfers, or allow asynchronous operations using Streams.

Warning: using a HostArray in conjunction with a large memory-mapped file has been observed to cause segfaults; at the present time there is no known workaround.

Modules and custom kernels

This will not teach you about CUDA programming; for that, please refer to the CUDA documentation and other online sources. You can find an example file in deps/utils.cu.

Compiling your own modules

You can write and use your own custom kernels, first writing a .cu file and compiling it as a ptx module. On Linux, compilation would look something like this:

nvcc -ptx mycudamodule.cu

You can specify that the code should be compiled for compute capability 2.0 devices or higher using:

nvcc -ptx -gencode=arch=compute_20,code=sm_20 mycudamodule.cu

If you want to write code that will support multiple datatypes (e.g., Float32 and Float64), it's recommended that you use C++ and write your code using templates. Then use extern C to instantiate bindings for each datatype. For example:

template <typename T>
__device__ void kernel_function1(T *data) {
    // Code goes here
}
template <typename T1, typename T2>
__device__ void kernel_function2(T1 *data1, T2 *data2) {
    // Code goes here
}

extern "C"
{
    void __global__ kernel_function1_float(float *data) {kernel_function1(data);}
    void __global__ kernel_function1_double(double *data) {kernel_function1(data);}
    void __global__ kernel_function2_int_float(int *data1, float *data2) {kernel_function2(data1,data2);}
}

Initializing and freeing PTX modules

To easily make your kernels available, the recommended approach is to define something analogous to the following for each ptx module (this example uses the kernels described in the previous section):

module MyCudaModule

import CUDAdrv: CuModule, CuModuleFile, CuFunction, cudacall
using CUDArt

export function1

const ptxdict = Dict()
const mdlist = Array{CuModule}(0)

function mdinit(devlist)
    global ptxdict
    global mdlist
    isempty(mdlist) || error("mdlist is not empty")
    for dev in devlist
        device(dev)
        md = CuModuleFile("mycudamodule.ptx")
        ptxdict[(dev, "function1", Float32)] = CuFunction(md, "kernel_function1_float")
        ptxdict[(dev, "function1", Float64)] = CuFunction(md, "kernel_function1_double")
        ptxdict[(dev, "function2", Int32, Float32)] = CuFunction(md, "kernel_function2_int_float")

        push!(mdlist, md)
    end
end

mdclose() = (empty!(mdlist); empty!(ptxdict))

function init(f::Function, devlist)
    local ret
    mdinit(devlist)
    try
        ret = f(devlist)
    finally
        mdclose()
    end
    ret
end

function function1{T}(data::CudaArray{T})
    dev = device(data)
    cufunction1 = ptxdict[(dev, "function1", T)]
    # Set up grid and block, see below
    cudacall(cufunction1, grid, block, (Ptr{T},), data)
end

...

end  # MyCudaModule

Usage will look something like the following:

using CUDArt, MyCudaModule

A = rand(10,5)

result = devices(dev->capability(dev)[1]>=2) do devlist
    MyCudaModule.init(devlist) do dev
        device(dev)
        function1(CudaArray(A))
    end
end

Grid and block dimensions

To be written.

Streams

One can use streams to manage or synchronize computations between the CPU & GPU, or using multiple CUDA devices. Using Julia's @sync and @async macros, here is a short demonstration that activates processing on multiple devices:

measured_sleep_time = CUDArt.devices(dev->true, nmax=2) do devlist
    sleeptime = 0.5
    results = Array{Float64}(3*length(devlist))
    streams = [(device(dev); Stream()) for dev in devlist]
    # Force one run to precompile
    cudasleep(sleeptime; dev=devlist[1], stream=streams[1])
    wait(streams[1])
    i = 1
    nextidx() = (idx=i; i+=1; idx)
    @sync begin
        for idev = 1:length(devlist)
            @async begin
                while true
                    idx = nextidx()
                    if idx > length(results)
                        break
                    end
                    tstart = time()
                    dev = devlist[idev]
                    stream = streams[idev]
                    cudasleep(sleeptime; dev=dev, stream=stream)
                    wait(stream)
                    tstop = time()
                    results[idx] = tstop-tstart
                end
            end
        end
    end
    results
end

In a more realistic version of this demonstration, you would "feed" work and collect the results from all of your CUDA devices using a single Julia process to organize the efforts.

Random notes

Notes on memory

Julia convention is that matrices are stored in column-major order, whereas C (and CUDA) use row-major. For efficiency this wrapper avoids reordering memory, so that the linear sequence of addresses is the same between main memory and the GPU. For most usages, this is probably what you want.

However, for the purposes of linear algebra, this effectively means that one is storing the transpose of matrices on the GPU. (TODO: create CudaMatrix and CudaPitchedMatrix types that will automatically take the transpose when copying between main and GPU memory. This will be useful for cuBLAS.)

Note that the size of a CudaArray/CudaPitchedArray is represented as the size of the corresponding main-memory object; thus, an array's dimensions (as reported by Julia) will not change when you copy it between main and GPU memory.

cudart.jl's People

Contributors

Stargazers

Watchers

cudart.jl's Issues

Unified Memory support

The following code reproduces the Unified Memory example from NVIDIA in Julia:
https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-add_cudart-jl

Kernel run time is the same as with the .cu compiled with nvcc, the nvprof output I get is this:

Time(%)      Time     Calls       Avg       Min       Max  Name
 61.18%  871.81us        11  79.255us  78.689us  79.872us  julia_kernel_add_61609
 38.82%  553.09us        11  50.280us  48.832us  53.344us  julia_kernel_init_61427

I decided to attempt to make the interface a little nicer, by creating a UnifiedArray type modeled after CuDeviceArray, represented in this file together with the test:
https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-unifiedarray-jl

Unfortunately, this runs significantly slower:

Time(%)      Time     Calls       Avg       Min       Max  Name
 56.90%  1.0317ms        11  93.792us  91.520us  100.48us  julia_kernel_add_61608
 41.03%  743.85us        11  67.622us  54.369us  77.472us  julia_kernel_init_61428
  2.07%  37.536us        55     682ns     640ns  1.1520us  [CUDA memcpy HtoD]

Comparing the @code_llvm output for the init kernel after the ifshows for the first version:

  %16 = getelementptr float, float* %1, i64 %15, !dbg !21
  %17 = getelementptr float, float* %0, i64 %15, !dbg !20
  store float 1.000000e+00, float* %17, align 8, !dbg !20, !tbaa !22
  store float 2.000000e+00, float* %16, align 8, !dbg !21, !tbaa !22
  br label %L47, !dbg !21

and for the UnifiedArray version:

  %16 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %0, i64 0, i32 0, !dbg !23
  %17 = add i64 %12, -1, !dbg !23
  %18 = load float*, float** %16, align 8, !dbg !23, !tbaa !20
  %19 = getelementptr float, float* %18, i64 %17, !dbg !23
  store float 1.000000e+00, float* %19, align 8, !dbg !23, !tbaa !24
  %20 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %1, i64 0, i32 0, !dbg !26
  %21 = load float*, float** %20, align 8, !dbg !26, !tbaa !20
  %22 = getelementptr float, float* %21, i64 %17, !dbg !26
  store float 2.000000e+00, float* %22, align 8, !dbg !26, !tbaa !24
  br label %L47, !dbg !26

So now for the questions:

Where does this difference in performance come from, and is it possible to keep the array abstraction and have it perform as well as the pointer version?
Are there any plans to add an array based on the Unified Memory model?
Are there any plans to wrap the CUDA8 functions, such as cudaMemPrefetchAsync?

p.s. great job on all these CUDA packages, this was a lot easier to set up than I had anticipated :)

[Question] Can one allocate the shared memory using CUDArt.jl?

Is it possible to allocate the shared memory or everything is done automatically always?
thanks in advance.

CUDArt assumptions not robust

There is some pathing in the following file that presents errors for standard 64 bit CUDA installations in
/root/.julia/v0.5/CUDArt/src/CUDArt.jl

Symptom: Pkg.add("CUDArt") fails

In particular the line:
const libcuda = Libdl.find_library(["libcuda"], ["/usr/lib/", "/usr/local/cuda/lib"])

wants to be
const libcuda = Libdl.find_library(["libcudart"], ["/usr/lib/", "/usr/local/cuda/lib64"])

Three issues:

Since I don't have a 32bit CUDA install, I can't make a more robust suggestion than to say that the environment variable for CUDA_HOME should be checked
I suspect that the find_library command might take another path such that you could put in cuda/lib64 ahead of cuda/lib
As of at least CUDA 8, its now libcudart instead of libcuda

julia> versioninfo()
Julia Version 0.5.1-pre+31
Commit 6a1e339 (2016-11-17 17:50 UTC)
Platform Info:
System: Linux (powerpc64le-linux-gnu)
CPU: unknown
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY POWER8)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.9.0 (ORCJIT, pwr8)

Gsoc proposal

Hello,

I intend to apply to Google Summer of Code by using the idea Writing high-performance, multithreaded kernels for image processing and my proposal, will use CUDArt.jl to improve the parallel image processing performance.

I'm writing here to ask for some feedback on what should I know first about CUDArt.jl to start working on my project or if you guys could help me on pointing how I can improve the idea.

This is my project proposal, feel free to have a look and comment on the dock or here in the thread.

Thanks.

Naelson Douglas

CUDArt should not rely on `nvidia-smi` or `nvml` on Mac OSX

This is on master for CUDArt:

julia> versioninfo()
Julia Version 0.6.0-pre.beta.437
Commit 552626cc97 (2017-04-30 05:06 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.5.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)


julia> Pkg.build("CUDArt")
INFO: Building CUDAdrv
INFO: CUDAdrv.jl has already been built for this CUDA library, no need to rebuild.
INFO: Building CUDArt
WARNING: NVML not found, resorting to nvidia-smi
===============================[ ERROR: CUDArt ]================================

LoadError: could not spawn `nvidia-smi`: no such file or directory (ENOENT)
while loading /Users/solver/.julia/v0.6/CUDArt/deps/build.jl, in expression starting on line 366

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: CUDArt had build errors.

 - packages with build errors remain installed in /Users/solver/.julia/v0.6
 - build the package(s) and all dependencies with `Pkg.build("CUDArt")`
 - build a single package by running its `deps/build.jl` script

================================================================================

Passing arbitrary struct arguments by value to kernels with StrPack?

If I understand correctly, I can use StrPack to ensure that a Julia value of a composite type can be converted to/from a string of bytes consistent with the binary representation expected in C code. I'm not too sure, but IIUC just passing a pointer to a Julia object is not (?) necessarily safe.

But I can't figure out how to launch a kernel using the binary representation of a Julia composite type. In other words, (but this is clearly wrong, because it can't be distinguished from passing a host pointer to a kernel, and I also get a totally different error below):

@struct type A; x :: Cint; end
iostr = IOBuffer(); pack(iostr, A(1))
...
launch(..., (iostr.data,))

where in C code:

struct A { int x; }
__global__ void kernel_fun(A a);

I looked in execute.jl, and it seems it's not implemented: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/execute.jl#L2 and I get an error that rawpointer is undefined for Array{UInt8,1}, which is what is in IOBuffer data.

I think whenever a value of a composite type is passed to a kernel, it might make sense to pass a pointer to its binary representation to cuLaunchKernel, so that the argument gets passed by value to the kernel.

[I should add] that calling cuLaunchKernel with ccall and constructing kernel arguments myself seems to work fine so far (although I haven't tested everything very much yet).

Tests fail on Windows with 0.6

Package builds without complaints but tests fail:

<a few warnings suppressed here>

ERROR: LoadError: LoadError: AssertionError: !(isactive(pctx))
Stacktrace:
 [1] unsafe_reset!(::CUDAdrv.CuPrimaryContext, ::Bool) at C:\Users\amellnik\.julia\v0.6\CUDAdrv\src\context\primary.jl:104
 [2] reset(::CUDAdrv.CuPrimaryContext) at .\deprecated.jl:59
 [3] device_reset(::Int64) at C:\Users\amellnik\.julia\v0.6\CUDArt\src\device.jl:44
 [4] macro expansion at C:\Users\amellnik\.julia\v0.6\CUDArt\test\gc.jl:22 [inlined]
 [5] anonymous at .\<missing>:?
 [6] include_from_node1(::String) at .\loading.jl:569
 [7] include(::String) at .\sysimg.jl:14
 [8] include_from_node1(::String) at .\loading.jl:569
 [9] include(::String) at .\sysimg.jl:14
 [10] process_options(::Base.JLOptions) at .\client.jl:305
 [11] _start() at .\client.jl:371
while loading C:\Users\amellnik\.julia\v0.6\CUDArt\test\gc.jl, in expression starting on line 8
while loading C:\Users\amellnik\.julia\v0.6\CUDArt\test\runtests.jl, in expression starting on line 1
===============================[ ERROR: CUDArt ]================================

failed process: Process(`'C:\Users\amellnik\AppData\Local\Julia-0.6.0\bin\julia.exe' -Cx86-64 '-JC:\Users\amellnik\AppData\Local\Julia-0.6.0\lib\julia\sys.dll' --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes 'C:\Users\amellnik\.julia\v0.6\CUDArt\test\runtests.jl'`, ProcessExited(1)) [1]

================================================================================
CUDArt had test errors

Stacktrace:
 [1] #test#62(::Bool, ::Function, ::Array{AbstractString,1}) at .\pkg\entry.jl:757
 [2] (::Base.Pkg.Entry.#kw##test)(::Array{Any,1}, ::Base.Pkg.Entry.#test, ::Array{AbstractString,1}) at .\<missing>:0
 [3] (::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}})() at .\pkg\dir.jl:36
 [4] cd(::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}}, ::String) at .\file.jl:59
 [5] #cd#1(::Array{Any,1}, ::Function, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N} where N) at .\pkg\dir.jl:36
 [6] (::Base.Pkg.Dir.#kw##cd)(::Array{Any,1}, ::Base.Pkg.Dir.#cd, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N} where N) at .\<missing>:0
 [7] #test#3(::Bool, ::Function, ::String, ::Vararg{String,N} where N) at .\pkg\pkg.jl:276
 [8] test(::String, ::Vararg{String,N} where N) at .\pkg\pkg.jl:276
 [9] include_string(::String, ::String) at .\loading.jl:515

This stracktrace is a bit cryptic, but it appears to stem from the next-to-last line in this basic test:

devlist = CUDArt.devices(dev->true)
for dev in devlist
    CUDArt.device(dev)
    p = CUDArt.malloc(UInt8, 1)
    p2 = CUDArt.malloc(UInt16, 100)
    CUDArt.free(p)
    CUDArt.free(p2)
    a = rand(5,3)
    g = CUDArt.CudaArray(a)
    gp = CUDArt.CudaPitchedArray(a)
    CUDArt.free(g)
    CUDArt.free(gp)
    # Also test finalizer calls
    g = CUDArt.CudaArray(a)
    g = CUDArt.CudaPitchedArray(a)
    CUDArt.device_reset(dev)
end

display CU_PARAM_TR_DEFAULT in checkdrv and what is it?

When I try this function on a server with Tesla GPU(device initialize and other staff is outside the function) , cuModuleLoad returns -1 in the console. But I didn't find this return in the cuModuleLoad's document. (All the returns are positive)

WARNING: /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.ptx
ERROR: LoadError: KeyError: -1 not found
 in checkdrv at /home/quaninfo/rogerluo/.julia/v0.4/CUDArt/src/module.jl:14
 in call at /home/quaninfo/rogerluo/.julia/v0.4/CUDArt/src/module.jl:24
 in diagexp at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.jl:19
 in diagexp at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.jl:29
 in realtimeop! at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/Adiabatic/timeop.jl:5
 in next_timestep! at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/Adiabatic/timeop.jl:20
 [inlined code] from util.jl:155
 in adia at /home/quaninfo/rogerluo/cooling-12.jl:8
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/quaninfo/rogerluo/cooling-12.jl, in expression starting on line 30

function diagexp(A::CudaArray{Complex64})
    md = CuModule("$dir/src/utils/cuda/cuMatrix.ptx", false)
    diagexp = CuFunction(md, "diagexp_cf")
    nsm = attribute(device(), rt.cudaDevAttrMultiProcessorCount)
    mul = min(32, ceil(Int, length(A)/(256*nsm)))
    expH = CudaArray(Complex64,size(A)...)
    launch(diagexp, mul*nsm, 256, (A,expH,length(A)))
    return expH
end

The CUDA document describe this as

For texture references loaded into the module, use default texunit from texture reference.

But it works fine on my own laptop with a GT730M GPU.

As a freshman in CUDA, I'm not familiar with this error/warning, is there anyone who knows how to solve it?

Compilation error

Hello

I'm trying to get CUDArt going on a HPC platform (meaning some libraries are in non-standard places). Could anyone shed light on the error below?

julia> using CUDArt
INFO: Precompiling module CUDArt...
ERROR: LoadError: LoadError: Cannot find libwrapcuda
in error at ./error.jl:21
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
[inlined code] from none:2
in anonymous at no file:0
in process_options at ./client.jl:252
in _start at ./client.jl:375
while loading /home/mcp50/.julia/v0.5/CUDArt/src/libcudart-6.5.jl, in expression starting on line 53
while loading /home/mcp50/.julia/v0.5/CUDArt/src/CUDArt.jl, in expression starting on line 27
ERROR: Failed to precompile CUDArt to /home/mcp50/.julia/lib/v0.5/CUDArt.ji
in error at ./error.jl:21
in compilecache at loading.jl:383
in require at ./loading.jl:250

Rename types to CuArray, CuMatrix and so forth for consistency with CUDAdrv?

and similarly for other names. Thoughts?

Perhaps this would cause some confusion however betwee the different array types and perhaps then it is not wise to do this.

Support Windows

Hi, all.

I'm trying to run CUDArt.jl in Windows. I feel comfortable to use Julia in linux systems but supporting many platforms is always good thing. Futhermore, my customers in optics lab are Muggle - they are not familiar with programming and linux systems.

Ok, after long idle talk, it would be simple to make run in Windows - maybe add correct path and name of CUDA runtime and driver DLL.

For CUDA 5.0 64bit

CUDA driver API : C:\Windows\system32\nvcuda.dll
CUDA runtime API : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\cudart64_50_35.dll

Driver API is simple, but DLL name of runtime API is more complicated. I think it have to be checked for several CUDA releases.

I checked both driver and runtime APIs are loaded properly, rest TODOs are:

Checked library name of runtime API in other CUDA releases (maybe 4.x, 5.0, 5.5, 6.0 will be enough)
Build wrapcuda.dll mingw+CUDA environment
Confirm all testcases are passed.

I will log progress here, and I wish I can send my first PR to repository soon. Thanks!

Types with CudaArray elements cannot be saved to JLD because of pointer exception

I'd like to fix this so people can save and load machine learning models that use CudaArray's without having to explicitly copy everything to cpu. Is the right way to overwrite serialize - deserialize? Or is it to introduce a new cpu array type that CudaArrays know to convert themselves to and from during load/save? Or is there some other way?

launch export is shadowing Base function

https://github.com/JuliaGPU/CUDArt.jl/blob/40af40e6e4d8aba2aacd44f57545784390a1ed7d/src/CUDArt.jl#L17

CUDArt fails to build when no CUDA device is present

julia> Pkg.build("CUDArt")
INFO: Building CUDAdrv
INFO: Found libcuda at /usr/bin/../lib/libcuda.so
INFO: CUDAdrv.jl has already been built for this CUDA library, no need to rebuild
INFO: Building CUDArt
NVIDIA: no NVIDIA devices found
=================================================[ ERROR: CUDArt ]=================================================

LoadError: CUDA error 30 calling cudaRuntimeGetVersion
while loading /home/wallnuss/.julia/v0.5/CUDArt/deps/build.jl, in expression starting on line 380

[Question] GPU->CPU Copy Speed?

I had a question about the use of CUDArt.jl in conjunction with CuBLAS.jl. I'm hoping to find out what I can do to improve the speed of memory copying from the GPU device back to CPU-accessible memory. I'm not sure if there is something I'm not implementing properly or if there is something intrinsic that I'm not properly understanding. I've conducted a series of tests using the BLAS Level-3 function, gemm which I detail below (exporting my IJulia notebook over to Markdown...) I should note that I'm using Julia v0.4-rc2 for these experiments.

Specifically, I've observed that for the implementation I detail below, the speed of the GPU->CPU copy! is about 2 orders of magnitude slower than the CPU->GPU copy. Surely I've done something wrong, right? If anyone could illuminate me on what I'm doing wrong, I'd be elated.

Thanks!

Testing CuBLAS and CUDArt for Julia

After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions.

I. Calling and using the Libraries

Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between v0.3 and v0.4, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there.

# Update and build
Pkg.update()
Pkg.build("CUDArt")
Pkg.build("CUBLAS")

using CUDArt
using CUBLAS
using Base.LinAlg.BLAS

II. Experiment Parameters

We will focus our comparisons on the BLAS function gemm which computes
$$ \mathbf{C} \leftarrow \alpha \mathbf{A}\mathbf{B} + \beta \mathbf{C}.$$
We will assume that all of these matrices are dense and real. For our experiments we will set
$\mathbf{A}: (n \times m)$, $\mathbf{B}: (m \times k)$, $\mathbf{C}: (n \times k)$, and
$\alpha = \beta = 1.0$.

# Dimensions
n = 256
m = 784
k = 100
# Scalings
a = 1.0
b = 1.0
# Initialization
A = randn(n,m);
B = randn(m,k);
C = randn(n,k);
whos()

                             A   1568 KB     256x784 Array{Float64,2} : [1.7596…
                             B    612 KB     784x100 Array{Float64,2} : [-1.596…
                          Base  26665 KB     Module : Base
                             C    200 KB     256x100 Array{Float64,2} : [-0.344…
                        CUBLAS    545 KB     Module : CUBLAS
                        CUDArt    573 KB     Module : CUDArt
                        Compat     58 KB     Module : Compat
                          Core   3218 KB     Module : Core
                DataStructures    337 KB     Module : DataStructures
                        IJulia    368 KB     Module : IJulia
                IPythonDisplay     26 KB     Module : IPythonDisplay
                          JSON    195 KB     Module : JSON
                          Main  33724 KB     Module : Main
                       MyGemm!    966 bytes  Function : MyGemm!
                  MyTimedGemm!     15 KB     Function : MyTimedGemm!
                        Nettle    187 KB     Module : Nettle
                           ZMQ     80 KB     Module : ZMQ
                             a      8 bytes  Float64 : 1.0
                             b      8 bytes  Float64 : 1.0
                           d_A     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           d_B     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           d_C     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArt…
                           dev      8 bytes  Int64 : 0
                             k      8 bytes  Int64 : 100
                             m      8 bytes  Int64 : 784
                             n      8 bytes  Int64 : 256
                        result      0 bytes  Void : nothing

III. Baseline Performance

We will now look at the timing of the base OpenBLAS implementation of gemm, which runs on the CPU, alone.

# Warmpup
gemm!('N','N',a,A,B,b,C);
gemm!('N','N',a,A,B,b,C);
# Time: 5 runs
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);

  0.000769 seconds (4 allocations: 160 bytes)
  0.000797 seconds (4 allocations: 160 bytes)
  0.000810 seconds (4 allocations: 160 bytes)
  0.000917 seconds (4 allocations: 160 bytes)
  0.001528 seconds (4 allocations: 160 bytes)

IV. CUDArt Datatypes

Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt.

# Assign Device
device(0)
device_reset(0) 
device(0)
# Create and Copy "A"
d_A = CudaArray(A)
copy!(d_A,A)
# Create and Copy "B"
d_B = CudaArray(B)
copy!(d_B,B)
# Create and Copy "C"
d_C = CudaArray(C)
copy!(d_C,C)
# Show 
println("CUDArt Data Pointer Descriptions")
println(d_A)
println(d_B)
println(d_C)

CUDArt Data Pointer Descriptions
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00a80000),(256,784),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00c20000),(784,100),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00d20000),(256,100),0)

V. CuBLAS Timings

Now, lets look at the time requirements for just running gemm. We note that this does not include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone.

# Warmpup
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
# Time: 5 runs
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);

  0.000033 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000045 seconds (24 allocations: 1.016 KB)
  0.000037 seconds (24 allocations: 1.016 KB)

So, we can see form the above that we are looking at an order of magnitude improvement in computation time, potentially.

# End Session
device_reset(0)

VI. CuBLAS Timings: With Memory Copying

We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison.

function MyTimedGemm!(tA,tB,a,A,d_A,B,d_B,b,C,d_C)
    # Copy to device
    @printf "(A->d_A)       " 
        @time copy!(d_A,A)
    @printf "(B->d_B)       " 
        @time copy!(d_B,B)
    @printf "(C->d_C)       " 
        @time copy!(d_C,C)
    # Run device-level BLAS
    @printf "(CUBLAS.gemm!) "
        @time CUBLAS.gemm!(tA,tB,a,d_A,d_B,b,d_C)
    # Gather result
    @printf "(d_C->C)       "
        @time copy!(C,d_C)
end

device(0)
device_reset(0)
device(0)

# These pointers can be pre-allocated
d_A = CudaArray(A)
d_B = CudaArray(B)
d_C = CudaArray(C)

# Warmup
println("Warmups============")
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
println("Actual=============")
@time MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);

Warmups============
(A->d_A)         0.000204 seconds
(B->d_B)         0.000317 seconds
(C->d_C)         0.000075 seconds
(CUBLAS.gemm!)   0.078434 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006428 seconds
(A->d_A)         0.000442 seconds
(B->d_B)         0.000361 seconds
(C->d_C)         0.000063 seconds
(CUBLAS.gemm!)   0.000043 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006849 seconds
Actual=============
(A->d_A)         0.000214 seconds
(B->d_B)         0.000307 seconds
(C->d_C)         0.000076 seconds
(CUBLAS.gemm!)   0.000038 seconds (20 allocations: 880 bytes)
(d_C->C)         0.007070 seconds
  0.008016 seconds (199 allocations: 7.813 KB)

We can see that the act of reading the matrix $\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of gemm.

Does CUDArt support cuda 8.0?

I have upgraded my server from ubuntu14.04 to 16.04, therefore cuda7.5 is not compatible with my system.

Unstable performance

Disclaimer: I used matrix multiplication from CUBLAS.jl as an example operations since CUDArt.jl doesn't provide anything like that, so results may be biased because of it. Anyway, I'll be glad to see any pointers.

With random CudaArray and identity matrix like this:

const A = rand(1024, 256)
const Im = eye(256, 256)
const d_A = CudaArray(A)
const d_Im = CudaArray(Im)

I do several performance tests like this:

CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A)

If you are not familiar with BLAS (or just don't like cryptic names), this code multiplies d_A by identity matrix d_Im and puts the result to d_A again. When I run same test on CPU, I always get very similar, consistent results. But on GPU benchmarks give totally different results:

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.018428 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.238584 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   2.931953 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   3.775450 seconds (25.00 k allocations: 937.500 KB)

# after some time
 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.020394 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.812287 seconds (25.00 k allocations: 937.500 KB)

So first call is really fast, but all subsequent calls take ~200x longer. After you wait some time (say, 10 seconds), multiplication becomes fast again, but also for one single test and then drops again.

Is it expected behavior? Do I use CudaArrays correctly at all?

warning in julia v0.4

I get this with the latest Julia and don't quite know how to fix it as I think gen_libcudart_h.jl is an automatically generated file.

WARNING: uint32(x) is deprecated, use UInt32(x) instead.
while loading /home/nlg-05/dy_052/kuparser/profile/v0.4/CUDArt/src/../gen-6.5/gen_libcudart_h.jl, in expression starting on line 41

New tag

error could not load library "libnvidia-ml"

julia> versioninfo()
Julia Version 0.6.0-pre.alpha.325
Commit 980119a* (2017-03-30 16:30 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.test("CUDArt")
INFO: Testing CUDArt
ERROR: LoadError: LoadError: error compiling #devices#1: error compiling filter_free: could not load library "libnvidia-ml"
The specified module could not be found.

Stacktrace:
 [1] devices(::Function) at C:\Users\Mus\.julia\v0.6\CUDArt\src\device.jl:52
 [2] include_from_node1(::String) at .\loading.jl:539
 [3] include(::String) at .\sysimg.jl:14
 [4] include_from_node1(::String) at .\loading.jl:539
 [5] include(::String) at .\sysimg.jl:14
 [6] process_options(::Base.JLOptions) at .\client.jl:305
 [7] _start() at .\client.jl:371
while loading C:\Users\Mus\.julia\v0.6\CUDArt\test\gc.jl, in expression starting on line 7
while loading C:\Users\Mus\.julia\v0.6\CUDArt\test\runtests.jl, in expression starting on line 1
================================================================================[ ERROR: CUDArt ]=================================================================================


failed process: Process(`'C:\Julia\Julia-0.6-latest\bin\julia' -Cnative '-JC:\Julia\Julia-0.6-latest\lib\julia\sys.dll' --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes 'C:\Users\Mus\.julia\v0.6\CUDArt\test\runtests.jl'`, ProcessExited(1)) [1]

==================================================================================================================================================================================

ERROR: CUDArt had test errors

Support for ptx modules with external functions

I'm not certain whether this issue is most appropriately situated in the CUDArt or CUDAdrv repositories or both. I'm posting in both, but will remove it from one or the other if advised so.

I am interesting in having the ability to compile ptx modules that include external functions in them and then import those as functions to use/launch from within Julia. The particular example I was recently working with was for CUBLAS functions, but the principal is far wider. I inquired about the issue on Stack Overflow here. I had thought that it would be relatively manageable, but from the answer I got, it actually sounds like it is relatively complex and involved. On the plus side, it does appear that there are precedents for establishing this kind of capability, e.g. with the JCUDA framework for Java.

I could potentially assist with such an implementation, but I doubt I'd be well positioned to take it on all myself.

Thoughts?

gcc5.4.0 support

Hello, I am using ubuntu 16.04 and the gcc version is 5.4.0. I find that cuda v8.0.61 has removed the limit of gcc version no more than 5.3, so I changed build.jl, line 154 to (v"8.0", v"5.4.0"). I think it makes sense so ubuntu users can make use of your masterpiece. Thanks a lot!

device_reset can throw errors and not reset the device

As it's implemented now, all wrappers of cuda functions check returns results for cudaSuccess and throw an error if the result is not cudaSuccess. This means that if for some reason, in device_reset: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L10 some function returns an error (e.g., a previous kernel launch by the user failed, or, as happens to me, irreproducibly (so not a bug I think) that one of cuda_ptrs is "an invalid device pointer", the clean up code throws an error and never executes cudaDeviceReset().

Also, I believe that cleaning up pointers before cudaDeviceReset() is unnecessary because that function's documentation says it releases all resources associated with the current device and process. So it isn't really necessary to free the pointers at all, they should be cleaned up by cudaDeviceReset.

This also means that in devices: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L59 the finally-clause can throw errors, and the first error interrupts the whole finally-clause, preventing the devices from being reset correctly.

The bug is that code in finally-clause should never throw errors in a way that prevents resources/devices from being cleaned up.

`devcount()` return zero when none available?

Currently, devcount() fails with an error if there's no device available, because in that case, it returns the cudaErrorNoDevice error code. It makes sense for most functions to fail when there's no device, but I'd hope devcount() to return zero so I can use it to detect the presence of a CUDA-capable GPU.

Or is there another way to determine the availability of a device? (My goal is to not run the GPU-related tests of my package when there's no GPU.)

If you agree with this suggestion, I can write a PR for it, I hope it's as easy as adding that function to this list.

Spurious failures in cudacopy! with "invalid argument" error

This is probably related to #17 and how finalizers work.

The following function:

function test1()
  devices(dev->true, nmax=1) do devlist
    dev = devlist[1]
    device(dev)

    sz = (801, 802)
    x = CudaPitchedArray[]
    for i=1:10
      push!(x, CudaPitchedArray(Float32, sz))
    end
    for i=1:10
      @time for j=1:length(x)
        to_host(x[j])
      end
      println("Finished iteration $i.")
    end
  end
end

produces the following error on the second time it is run. If I run gc() in between the two runs, there is no error.

julia> versioninfo()
Julia Version 0.4.0-dev+2876
Commit f164ac1 (2015-01-22 22:58 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> Test.test1()
elapsed time: 0.014716371 seconds (25716080 bytes allocated)
Finished iteration 1.
elapsed time: 0.071074413 seconds (25716080 bytes allocated, 74.17% gc time)
Finished iteration 2.
elapsed time: 0.010117429 seconds (25716080 bytes allocated)
Finished iteration 3.
elapsed time: 0.063922907 seconds (25716080 bytes allocated, 81.96% gc time)
Finished iteration 4.
elapsed time: 0.010150908 seconds (25716080 bytes allocated)
Finished iteration 5.
elapsed time: 0.06263613 seconds (25716080 bytes allocated, 83.62% gc time)
Finished iteration 6.
elapsed time: 0.010153057 seconds (25716080 bytes allocated)
Finished iteration 7.
elapsed time: 0.062723099 seconds (25716080 bytes allocated, 83.70% gc time)
Finished iteration 8.
elapsed time: 0.06263111 seconds (25716080 bytes allocated, 83.61% gc time)
Finished iteration 9.
elapsed time: 0.01013381 seconds (25716080 bytes allocated)
Finished iteration 10.

julia> Test.test1()
WARNING: CUDA error triggered from:

 in checkerror at /***/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:15
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:313
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:288
 in copy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:282
 in to_host at /***/.julia/v0.4/CUDArt/src/arrays.jl:87
 in anonymous at /***/test.jl:17
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:57
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:49
 in test1 at /***/test.jl:6ERROR: "invalid argument"
 in checkerror at /***/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:16
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:313
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:288
 in copy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:282
 in to_host at /***/.julia/v0.4/CUDArt/src/arrays.jl:87
 in anonymous at /***/test.jl:17
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:57
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:49
 in test1 at /***/test.jl:6

This is with the git head version of CUDArt, and probably has something to do with a garbage collection pass trying to collect a Cuda pointer that came from a previous device context (before device_reset called cudaDeviceReset), so that the pointer is invalid in the new device context.

This is very irritating when testing Cuda code in the repl when the same function is run over and over again, sometimes not even correctly, so resetting everything correctly is a must.

strange splatting bug

I am not sure if this bug belongs here or in the base or if I should get my head checked but this got me stumpted. And it does not happen when I replace CudaArray's with Array's. So I am posting here:

using CUDArt

function bmultest5(a::CudaArray, d::Dims)
    @show typeof(size(a))
    @show size(a)
    b = reinterpret(eltype(a), a, d)
    @show typeof(size(b))
    @show size(b)
    @show Cint[size(b)...]
end

bmultest5(CudaArray(zeros(2)), (2,1,1,1))

gives

typeof(size(a)) = Tuple{Int64}
size(a) = (2,)
typeof(size(b)) = Tuple{Int64,Int64,Int64,Int64}
size(b) = (2,1,1,1)
Cint[size(b)...] = Int32[2,1]

i.e. if size(b) has more than 2 elements, only the first two show up in the result of Cint[size(b)...]!

Cannot find library libwrapcuda

julia> using CUDArt                                                                       
INFO: Precompiling module CUDArt...                                                       
ERROR: LoadError: LoadError: Cannot find libwrapcuda                                      
 in error at ./error.jl:21                                                                
 in include at ./boot.jl:261                                                              
 in include_from_node1 at ./loading.jl:320                                                
 in include at ./boot.jl:261                                                              
 in include_from_node1 at ./loading.jl:320                                                
 [inlined code] from none:2                                                               
 in anonymous at no file:0                                                                
 in process_options at ./client.jl:257                                                    
 in _start at ./client.jl:378                                                             
while loading /home/rluser/.julia/v0.4/CUDArt/src/libcudart-6.5.jl, in expression starting
 on line 53                                                                               
while loading /home/rluser/.julia/v0.4/CUDArt/src/CUDArt.jl, in expression starting on lin
e 27                                                                                      
ERROR: Failed to precompile CUDArt to /home/rluser/.julia/lib/v0.4/CUDArt.ji              
 in error at ./error.jl:21                                                                
 in compilecache at loading.jl:400                                                        
 in require at ./loading.jl:266                                                           

julia> Libdl.find_library(["libwrapcuda"],["/home/rluser/.julia/v0.4/CUDArt/deps/"])      
""     

julia> libwrapcuda = Libdl.find_library(["libwrapcuda"],[joinpath(Pkg.dir(), "CUDArt", "de
ps")])                                                                                    
""    

julia> cd(joinpath(Pkg.dir(), "CUDArt", "deps"))

shell> ls
build.jl  libwrapcuda.so  Makefile  utils.cu  utils.ptx  Windows.mk  wrapcuda.c

Cannot find CUDA runtime API in Ubuntu 14.04, CUDA 6.5

Before send PR, I tested the code in my linux system.
But even in master branch, I got error that cannot find cuda runtime api.

I had to update the location of libcudart as below commit: /usr/local/cuda -> /usr/local/cuda/lib64

moon6pence@923e05c

libcuda has no problem to load, but I update the location that really it is.

CuEvent and performance benchmark in device time

Implement interface for using cuda events.

I did it long time ago in CUDA.jl package, so it will be simple task.
moon6pence/CUDA.jl@ab1aed8

And we can implement such @elapsed and tic/toc() stuff in GPU time.
We can use CPU time functions but GPU time looks be more accurate and usually gives faster result.

Support various versions of CUDA Toolkit

Current CUDArt.jl use generated api bindings from CUDA Toolkit 6.5. (libcudart-6.5.jl and files in gen-6.5/)
And CUDA 5.5 files seems to be not used. (libcudart-5.0.jl and files in gen/)

I simplely tested current 6.5 api binding with CUDA 5.0/5.5/6.0/6.5 and even 7.0 RC.
(Yeah, I want to use boring time in this afternoon to just click install buttons. It is better test in Windows rather than linux.)
By result, current 6.5 binding works well in all those versions because we don't use any special new functions in recent cuda runtimes.

So my suggestion is:

Remove api bindings from CUDA 5.0 and keep only latest api binding.
Use find_library for libcudart with version 5.0 - 7.0 and we can say CUDArt.jl supports all these versions.
If there's new functionality that uses api from certain version of cuda runtime, we have to check the version of installed cuda toolkit.

How do you think?

Runtime kernel compilation in CUDA 7

Reference: http://www.soa-world.de/echelon/2015/01/cuda-7-runtime-compilation.html

CUDA 7 is in release candidate state now and it has very interisting feature: runtime kernel compilation.
It is just like how OpenCL works. We can just pass kernel source as string and get CuFunction object.

I think it is very good news for CUDArt package, we can write CUDA kernels more easily rather than use external nvcc compiler to get PTX files.

Another usage is metaprogramming for generating kernels.
For example, @kk49 presented interesting concept to write GPU code for arithemetic operations https://github.com/kk49/julia-delayed-matrix

julia-delayed-matrix generates PTX code directly, but we can do better by generate .cu code in string.

I'm NVIDIA registered programmer and got fresh CUDA 7.0 RC. First work have to be finding how runtime kernel compilation API works. And we need such a gen-7.0 stuff.

Makefile needs to select correct gcc compiler

Plainly invoking nvcc isn't guaranteed to work:

nvcc -ptx -gencode=arch=compute_20,code=sm_20 utils.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
In file included from /opt/cuda/bin/..//include/cuda_runtime.h:78:0,
                 from <command-line>:0:
/opt/cuda/bin/..//include/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 5 are not supported!
 #error -- unsupported GNU version! gcc versions later than 5 are not supported!
  ^~~~~
Makefile:15: recipe for target 'utils.ptx' failed
make: *** [utils.ptx] Error 1

Luckily, I've already implemented such a mechanism as part of CUDAdrv. Maybe we should move this functionality, and the entire compilation example, over to CUDArt (since CUDAdrv doesn't require nvcc to be installed except for that example)?

No method matching reset(::Cudadrv.CuPrimaryContext)

julia> using CUDArt

julia> result = devices(dev->true) do devlist
           # Code that does GPU computations
       end
WARNING: destroy(ctx::CuContext) is deprecated, use destroy!(ctx) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] destroy(::CUDAdrv.CuContext) at ./deprecated.jl:57
 [3] device_reset(::Int64) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:42
 [4] close(::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:186
 [5] devices(::##1#3, ::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:84
 [6] devices(::Function, ::Function) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:74
 [7] eval(::Module, ::Any) at ./boot.jl:235
 [8] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [9] macro expansion at ./REPL.jl:97 [inlined]
 [10] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
ERROR: MethodError: no method matching reset(::CUDAdrv.CuPrimaryContext)
Closest candidates are:
  reset(::Base.LibuvStream) at stream.jl:1129
  reset(::T<:IO) where T<:IO at io.jl:622
Stacktrace:
 [1] device_reset(::Int64) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:44
 [2] close(::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:186
 [3] devices(::##1#3, ::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:84
 [4] devices(::Function, ::Function) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:74

julia>

Seems like a mismatch between CUDArt and the API of CUDAdrv. I get this both on Julia 0.5.2 and 0.6.

Precompile Error

This error is baffling me, and wondering if it is even related to CUDArt, but I get the following error on Julia 0.6 (compiled myself):

julia> using CUDArt
INFO: Precompiling module CUDArt.
WARNING: `@windows` is deprecated, use `@static is_windows()` instead
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64
 [2] @windows(::ANY, ::ANY) at ./deprecated.jl:446
 [3] include_from_node1(::String) at ./loading.jl:539
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:539
 [6] include(::String) at ./sysimg.jl:14
 [7] anonymous at ./<missing>:2
 [8] eval(::Module, ::Any) at ./boot.jl:236
 [9] process_options(::Base.JLOptions) at ./client.jl:279
 [10] _start() at ./client.jl:368
while loading /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl, in expression starting on line 23
WARNING: Base.WORD_SIZE is deprecated.
  likely near /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl:36
ERROR: LoadError: LoadError: LoadError: UndefVarError: textureReference not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:539
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:539
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:539
 [6] include(::String) at ./sysimg.jl:14
 [7] anonymous at ./<missing>:2
while loading /home/ju17693/.julia/v0.6/CUDArt/src/../gen-6.5/gen_libcudart.jl, in expression starting on line 515
while loading /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl, in expression starting on line 44
while loading /home/ju17693/.julia/v0.6/CUDArt/src/CUDArt.jl, in expression starting on line 27
ERROR: Failed to precompile CUDArt to /home/ju17693/.julia/lib/v0.6/CUDArt.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:673
 [2] require(::Symbol) at ./loading.jl:460

Here is my Julia versioninfo:

julia> versioninfo()
Julia Version 0.6.0-dev.2360
Commit 8c2d9db* (2017-01-25 20:04 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Can CUDArt be loaded with VS2015?

I have the following problem when I try to add CUDArt.

I have already installed VS2015. But it seems that I should install VS2013,2012 or 2010. I want to know if it necessary for me to install another version of Visual Studio. If not, what should I do?

Pkg.add("CUDArt")
INFO: Cloning cache of CUDArt from git://github.com/JuliaGPU/CUDArt.jl.git
INFO: Installing CUDArt v0.2.3
INFO: Building CUDArt
===============================[ ERROR: CUDArt ]================================

LoadError: Cannot find proper Visual Studio installation. VS 2013, 2012, or 2010 is required.
while loading C:\Users\Miller.julia\v0.4\CUDArt\deps\build.jl, in expression starting on line 10

================================[ BUILD ERRORS ]================================

WARNING: CUDArt had build errors.

packages with build errors remain installed in C:\Users\Miller.julia\v0.4
build the package(s) and all dependencies with Pkg.build("CUDArt")
build a single package by running its deps/build.jl script

INFO: Package database updated

Updated build script for visual studio 17 but get compile errors

Updated build script targeting changes to build directory and so forth for visual studio 2017.

using Compat

if is_windows()

    vswhere = download("https://github.com/Microsoft/vswhere/releases/download/1.0.58/vswhere.exe")

    vs_install_path = chomp(readstring(`$vswhere  -latest -property installationPath`))
    if !isdir(vs_install_path)
        error("Cannot find a proper Visual Studio installation. Make sure Visual Studio is installed.")
    end

    vs_version_major = parse(split(chomp(readstring(`$vswhere  -latest -property installationVersion`)), '.')[1])
    if vs_version_major >= 15
        vs_cmd_prompt = joinpath(vs_install_path, "VC", "Auxiliary", "Build", "vcvarsall.bat")
    else
        vs_cmd_prompt = joinpath(vs_install_path, "VC", "vcvarsall.bat")
    end

    # check whether 32 or 64 bit archtecture
    # NOTE: Actually, nvcc in x86 visual studio command prompt doesn't make 32-bit binary
    #       It depends on whether CUDA toolkit is 32bit or 64bit
    if Int == Int64
        arch = "amd64"
    else
        arch = "x86"
    end

    # Run nmake -f Windows.mk under visual studio command prompt
    cd(@__DIR__) do
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk clean`)
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk`)
    end

    cd(joinpath(@__DIR__, "..", "test")) do
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk clean`)
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk`)
    end
else # for linux or mac
    cd(@__DIR__) do
        run(`make clean`)
        run(`make`)
    end
    cd(joinpath(@__DIR__, "..", "test")) do
        run(`make clean`)
        run(`make`)
    end
end

however gives the following problem

julia> include("C:\\Users\\Mus\\.julia\\v0.5\\CUDArt\\deps\\build.jl")
**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.0.26228.12
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

Microsoft (R) Program Maintenance Utility Version 14.10.25017.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        del /Q libwrapcuda.dll libwrapcuda.lib libwrapcuda.exp
Could Not Find C:\Users\Mus\.julia\v0.5\CUDArt\deps\libwrapcuda.dll
        del /Q utils.ptx
Could Not Find C:\Users\Mus\.julia\v0.5\CUDArt\deps\utils.ptx
**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.0.26228.12
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

Microsoft (R) Program Maintenance Utility Version 14.10.25017.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        nvcc --shared --compiler-options="/wd4819" --linker-options= wrapcuda.c -o libwrapcuda.dll
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Microsoft (R) C/C++ Optimizing Compiler Version 19.10.25017 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tmpxft_00000a30_00000000-1.cpp
nvcc fatal   : Host compiler targets unsupported OS.
NMAKE : fatal error U1077: '"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.EXE"' : return code '0x1'
Stop.
ERROR: LoadError: failed process: Process(`cmd /C 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat' amd64 & nmake -f Windows.mk`, ProcessExited(2)) [2]
 in pipeline_error(::Base.Process) at .\process.jl:616
 in run(::Cmd) at .\process.jl:592
 in cd(::##1#4, ::String) at .\file.jl:48
 in include_from_node1(::String) at .\loading.jl:488
while loading C:\Users\Mus\.julia\v0.5\CUDArt\deps\build.jl, in expression starting on line 29

void * type cuda malloc does not work

Allocating memory in the GPU works super well for most types. E.g.

julia> CUDArt.malloc( Cint, 10 )
CUDArt.CudaPtr{Int32}(Ptr{Int32} @0x00000013047a0000)

However, when trying to allocate memory for a void pointer (as e.g., the CUB library requires for some functions), the returned pointer is always 0:

julia> CUDArt.malloc( Void, 10 )
CUDArt.CudaPtr{Void}(Ptr{Void} @0x0000000000000000)

device_reset not exported

Was this an intentional omission? You can't follow along with the README unless you explicitly import device_reset.

error in running finalizer: ErrorException("auto_unbox: unable to determine argument type")

Continuing from: https://groups.google.com/d/topic/julia-dev/NqPz4f_0VLg/discussion

I get this error intermittently on exit from Julia. I was able to trace it to the finalizer of CudaPtr (pointer.jl:44), in particular to the statement haskey(cuda_ptrs, p). Here is what I know so far:

If I comment out the CudaPtr finalizer (pointer.jl:38) the error disappears.
If I comment out haskey(cuda_ptrs, p) (pointer.jl:46) the error disappears.
The error does not consistently appear: Sometimes if a gc or a user-free occurs before Julia exit and calls the CudaPtr finalizer the error disappears. I am still trying to pinpoint the exact condition.
The debugger backtrace looks like this is happening during compilation maybe? However I am having trouble accessing the C variables in gdb so I am not sure: https://gist.github.com/denizyuret/161cf7e8b79266809a27

cannot initialize CudaArray with Int32 size

This took me a while to figure out today, while debugging the new CUSPARSE package:

ERROR: LoadError: MethodError: `convert` has no method matching convert(::Type{CUDArt.CudaArray{T,N}}, ::Type{Int32}, ::Tuple{Int32})
This may have arisen from a call to the constructor CUDArt.CudaArray{T,N}(...),
since type constructors fall back to convert methods.
Closest candidates are:
  CUDArt.CudaArray(::Type{T}, !Matched::Integer...)
  CUDArt.CudaArray(::Type{T}, !Matched::Tuple{Vararg{Int64}})
  call{T}(::Type{T}, ::Any)
  ...

The relevant part from arrays.jl:

CudaArray(T::Type, dims::Integer...) = CudaArray(T, dims)

function CudaArray(T::Type, dims::Dims)

i.e. CudaArray(T,Int32) is happy, but CudaArray(T,(Int32,)) is not. Dims is defined strictly as Int64:

Dims => Tuple{Vararg{Int64}}

Should Dims also be Tuple{Vararg{Integer}} ? Maybe this is a Base problem...

clone into "CUDArt" folder instead of "CUDArt.jl"

Currently the package gets cloned into a "CUDArt.jl" directory and fails to build. It turns out the build code looks for a "CUDArt" folder. Renaming the folder after cloning is enough to allow me to build successfully.

triggering gc based on gpu memory

Is there any progress on making gc sensitive to remaining gpu memory? The following example still fails with an out-of-memory error. It works if you uncomment the manual gc() line.

using CUDArt
for i=1:1000
    a=CudaArray(Float64,1000000)
    # i%100==0 && gc()                                                                                                     
end

Intermittent GC-related test failure (`isempty(cuda_ptrs)`)

@maleadt noticed that every so often the CI would fail due to a test error in gc.jl.
The specific test that fails is: https://github.com/JuliaGPU/CUDArt.jl/blob/7c7157019fa69539b2547b5317d9770c89d7d462/test/gc.jl#L57

g = CUDArt.CudaArray(a)
# In case the next gc() test yields an error, the next two lines let
# us do some archaeology
dictcopy = deepcopy(CUDArt.cuda_ptrs)
gptrcopy = copy(pointer(g))
gc()   # Check that this doesn't delete the new g
@test !isempty(CUDArt.cuda_ptrs)
h_g = CUDArt.to_host(g)
CUDArt.free(g)

With a little bit of debug information added and TRACE=1 I get:
while TRACE=1 julia --compilecache=no -e 'Pkg.test("CUDArt")'; do :; done

TRACE: Malloc CudaPtr Ptr{Void} @0x0000002303ee0000
TRACE: Finalizing CuContext at Ptr{Void} @0x00007f06843d85d0
TRACE: Not destroying context CUDAdrv.CuContext(Ptr{Void} @0x0000000003f03500,false,false) because we don't own it
TRACE: Freeing CudaPtr CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000002303ee0000,CUDAdrv.CuContext(Ptr{Void} @0x0000000003f03500,false,false))

I am currently travelling but should get to this before long.

GCC Version On CUDA 8.0

Hi!
So the max version that this package's build script will allow for gcc on CUDA 8.0 is 5.3.1. However, even Xenial has moved on with gcc-5 up to 5.4.0 at least. https://packages.ubuntu.com/xenial/gcc-5 This makes it impossible to build on anything newer than a slightly out of date 16.04 machine without hacking the build.jl. I use 17.04, so for me it's just flat out ridiculous to get anything older than 5.4.1 (which works fine with CUDA).

I propose bumping the build.jl requirement from 5.3.1 to 5.4.1. Thanks!

Conversion error in drawing the Julia set

I'm trying to use the following code to draw the Julia set (using Julia 0.3.5):

julia.jl

using CUDArt
using PyPlot

w = 2048 * 2
h = 2048 * 2
q = [complex64(r, i) for i = 1 : -(2.0 / w) : -1, r = -1.5 : (3.0 / h) : 1.5]

julia(q :: Array{Complex64}, maxiter :: Integer) = begin
    res = CUDArt.devices(dev -> CUDArt.capability(dev)[1] >= 2, nmax = 1) do devlist
        CUDArt.device(devlist[1])

        prg = CUDArt.CuModule("julia.ptx") do md
            juliafn = CUDArt.CuFunction(md, "julia")
            out = CUDArt.CudaArray(Uint16, size(q))
            CUDArt.launch(juliafn, length(q), 1, (q, out, uint16(maxiter)))
            CUDArt.to_host(out)
        end
    end

    res
end

j = julia(q, 200)
imshow(j, extent = [-1.5, 1.5, -1.0, 1.0])

julia.cu (compiled to julia.ptx via nvcc -ptx julia.cu)

extern "C"
{
    __global__ void julia(float2* q, unsigned short* out, const unsigned short maxiter)
    {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;

        float nreal = 0.0f;
        float real = q[idx].x;
        float imag = q[idx].y;

        out[idx] = 0;

        for (int i = 0; i < maxiter; i++)
        {
            if (real * real + imag * imag > 4.0f)
            {
                out[idx] = i;
            }

            nreal = real * real - imag * imag + (-0.5f);
            imag = 2 * real * imag + 0.75f;
            real = nreal;
        }
    }
}

When trying to run the code, it exits with the error 'rawpointer' has no method matching rawpointer(::Array{Complex{Float32},2}). Is this a lacking convert for the type, or is there something wrong with my code? Sorry for not posting this on StackOverflow or /r/julia but I expect people here might be better informed about the peculiarities of CUDArt itself.

broken in Julia v0.3

When I start with a clean environment and try to add and use CUDArt to Julia3 I get the following error:

julia> using CUDArt
ERROR: Ref not defined
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in reload_path at loading.jl:152
 in _require at loading.jl:67
 in require at loading.jl:51
while loading /home/nlg-05/dy_052/julia-pkgdir/v0.3/CUDArt/src/stream.jl, in expression starting on line 46
while loading /home/nlg-05/dy_052/julia-pkgdir/v0.3/CUDArt/src/CUDArt.jl, in expression starting on line 53

Can we tell Julia3 to use an older version by default maybe? Or use more Compat? What is the right way to maintain a package to be compatible with multiple versions? (I am asking because I need to keep a couple of packages functional in Julia3 and Julia4 for a while... :)

ERROR: unknown package CUDArt

Hi,

Pkg.add("CUDArt") gives me:
ERROR: unknown package CUDArt

hope I am missing something trivial

thanks

Failed to install CUDArt on Windows 7

I have Julia 0.4.1 and CUDA 6.5 installed on my PC.
When I tried to install CUDArt.jl, I got the following erros:

    nvcc --shared --compiler-options="/wd4819" --linker-options= wrapcuda.c

-o libwrapcuda.dll
Internal error
NMAKE : fatal error U1077: '"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA
v6.5\bin\nvcc.EXE"' : return code '0xc0000005'
Stop.
===============================[ ERROR: CUDArt ]================================

LoadError: failed process: Process(cmd /C 'C:\Program Files (x86)\Microsoft Vis ual Studio 12.0\Common7\Tools\..\..\VC\vcvarsall.bat' amd64 & nmake -f Windows.m k, ProcessExited(2)) [2]
while loading C:\Users\liu.julia\v0.4\CUDArt\deps\build.jl, in expression start
ing on line 23

If anyone had the same problem? or any help from you?
Thank you so much.

Gengdai

CudaArray doesn't have a finalizer, CudaPitchedArray does

I was looking at how long a pointer generated with CudaArray is guaranteed to stay alive for, and I noticed that CudaArray doesn't call finalizer(..., free) on itself in its constructor, so it looks like its pointer will be alive until the array gets freed explicitly, while CudaPitchedArray has a very different behaviour, and losing a reference to a CudaPitchedArray seems to deallocate the underlying pointer.

Is there a reason for implementing such different behaviours? It doesn't get described in the README, and I think it would be less confusing to have the two array types behave similarly (one way or the other). I don't know what was intended there, but I was expecting CudaArray to be finalized as well.

Another confusing thing is that malloc adds its own finalizer (https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/pointer.jl#L38), but if the reference to a CudaPitchedArray gets lost, doesn't the pointer get freed regardless of whether there are still references to the pointer itself?

I'm not really saying any of this is strictly wrong, just that the rules governing malloc, free, and finalizers should be as clear as possible.

OOB during package build

Building CUDArt fails. After solving the rest of problems I encountered in the way, I get this error now, which I cannot solve

LoadError: BoundsError: attempt to access 0-element Array{String,1} at index [1]
...

I'm not an expert on this, so I can give more details if you want, but I preferred to keep the thread as simple as possible.