Coder Social home page Coder Social logo

cudart.jl's People

Contributors

adambrewster avatar bpiwowar avatar denizyuret avatar emreyolcu avatar femtocleaner[bot] avatar jobjob avatar kristofferc avatar lucasb-eyer avatar maleadt avatar malmaud avatar mikeinnes avatar moon6pence avatar musically-ut avatar musm avatar mweastwood avatar simondanisch avatar timholy avatar vchuravy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cudart.jl's Issues

No method matching reset(::Cudadrv.CuPrimaryContext)

julia> using CUDArt

julia> result = devices(dev->true) do devlist
           # Code that does GPU computations
       end
WARNING: destroy(ctx::CuContext) is deprecated, use destroy!(ctx) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:70
 [2] destroy(::CUDAdrv.CuContext) at ./deprecated.jl:57
 [3] device_reset(::Int64) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:42
 [4] close(::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:186
 [5] devices(::##1#3, ::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:84
 [6] devices(::Function, ::Function) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:74
 [7] eval(::Module, ::Any) at ./boot.jl:235
 [8] eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:66
 [9] macro expansion at ./REPL.jl:97 [inlined]
 [10] (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:73
while loading no file, in expression starting on line 0
ERROR: MethodError: no method matching reset(::CUDAdrv.CuPrimaryContext)
Closest candidates are:
  reset(::Base.LibuvStream) at stream.jl:1129
  reset(::T<:IO) where T<:IO at io.jl:622
Stacktrace:
 [1] device_reset(::Int64) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:44
 [2] close(::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:186
 [3] devices(::##1#3, ::Array{Int64,1}) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:84
 [4] devices(::Function, ::Function) at /home/julieta/.julia/v0.6/CUDArt/src/device.jl:74

julia>

Seems like a mismatch between CUDArt and the API of CUDAdrv. I get this both on Julia 0.5.2 and 0.6.

Precompile Error

This error is baffling me, and wondering if it is even related to CUDArt, but I get the following error on Julia 0.6 (compiled myself):

julia> using CUDArt
INFO: Precompiling module CUDArt.
WARNING: `@windows` is deprecated, use `@static is_windows()` instead
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64
 [2] @windows(::ANY, ::ANY) at ./deprecated.jl:446
 [3] include_from_node1(::String) at ./loading.jl:539
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:539
 [6] include(::String) at ./sysimg.jl:14
 [7] anonymous at ./<missing>:2
 [8] eval(::Module, ::Any) at ./boot.jl:236
 [9] process_options(::Base.JLOptions) at ./client.jl:279
 [10] _start() at ./client.jl:368
while loading /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl, in expression starting on line 23
WARNING: Base.WORD_SIZE is deprecated.
  likely near /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl:36
ERROR: LoadError: LoadError: LoadError: UndefVarError: textureReference not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:539
 [2] include(::String) at ./sysimg.jl:14
 [3] include_from_node1(::String) at ./loading.jl:539
 [4] include(::String) at ./sysimg.jl:14
 [5] include_from_node1(::String) at ./loading.jl:539
 [6] include(::String) at ./sysimg.jl:14
 [7] anonymous at ./<missing>:2
while loading /home/ju17693/.julia/v0.6/CUDArt/src/../gen-6.5/gen_libcudart.jl, in expression starting on line 515
while loading /home/ju17693/.julia/v0.6/CUDArt/src/libcudart-6.5.jl, in expression starting on line 44
while loading /home/ju17693/.julia/v0.6/CUDArt/src/CUDArt.jl, in expression starting on line 27
ERROR: Failed to precompile CUDArt to /home/ju17693/.julia/lib/v0.6/CUDArt.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:673
 [2] require(::Symbol) at ./loading.jl:460

Here is my Julia versioninfo:

julia> versioninfo()
Julia Version 0.6.0-dev.2360
Commit 8c2d9db* (2017-01-25 20:04 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Support Windows

Hi, all.

I'm trying to run CUDArt.jl in Windows. I feel comfortable to use Julia in linux systems but supporting many platforms is always good thing. Futhermore, my customers in optics lab are Muggle - they are not familiar with programming and linux systems.

Ok, after long idle talk, it would be simple to make run in Windows - maybe add correct path and name of CUDA runtime and driver DLL.

For CUDA 5.0 64bit

  • CUDA driver API : C:\Windows\system32\nvcuda.dll
  • CUDA runtime API : C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\cudart64_50_35.dll

Driver API is simple, but DLL name of runtime API is more complicated. I think it have to be checked for several CUDA releases.

I checked both driver and runtime APIs are loaded properly, rest TODOs are:

  • Checked library name of runtime API in other CUDA releases (maybe 4.x, 5.0, 5.5, 6.0 will be enough)
  • Build wrapcuda.dll mingw+CUDA environment
  • Confirm all testcases are passed.

I will log progress here, and I wish I can send my first PR to repository soon. Thanks!

Passing arbitrary struct arguments by value to kernels with StrPack?

If I understand correctly, I can use StrPack to ensure that a Julia value of a composite type can be converted to/from a string of bytes consistent with the binary representation expected in C code. I'm not too sure, but IIUC just passing a pointer to a Julia object is not (?) necessarily safe.

But I can't figure out how to launch a kernel using the binary representation of a Julia composite type. In other words, (but this is clearly wrong, because it can't be distinguished from passing a host pointer to a kernel, and I also get a totally different error below):

@struct type A; x :: Cint; end
iostr = IOBuffer(); pack(iostr, A(1))
...
launch(..., (iostr.data,))

where in C code:

struct A { int x; }
__global__ void kernel_fun(A a);

I looked in execute.jl, and it seems it's not implemented: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/execute.jl#L2 and I get an error that rawpointer is undefined for Array{UInt8,1}, which is what is in IOBuffer data.

I think whenever a value of a composite type is passed to a kernel, it might make sense to pass a pointer to its binary representation to cuLaunchKernel, so that the argument gets passed by value to the kernel.

[I should add] that calling cuLaunchKernel with ccall and constructing kernel arguments myself seems to work fine so far (although I haven't tested everything very much yet).

Intermittent GC-related test failure (`isempty(cuda_ptrs)`)

@maleadt noticed that every so often the CI would fail due to a test error in gc.jl.
The specific test that fails is: https://github.com/JuliaGPU/CUDArt.jl/blob/7c7157019fa69539b2547b5317d9770c89d7d462/test/gc.jl#L57

g = CUDArt.CudaArray(a)
# In case the next gc() test yields an error, the next two lines let
# us do some archaeology
dictcopy = deepcopy(CUDArt.cuda_ptrs)
gptrcopy = copy(pointer(g))
gc()   # Check that this doesn't delete the new g
@test !isempty(CUDArt.cuda_ptrs)
h_g = CUDArt.to_host(g)
CUDArt.free(g)

With a little bit of debug information added and TRACE=1 I get:
while TRACE=1 julia --compilecache=no -e 'Pkg.test("CUDArt")'; do :; done

TRACE: Malloc CudaPtr Ptr{Void} @0x0000002303ee0000
TRACE: Finalizing CuContext at Ptr{Void} @0x00007f06843d85d0
TRACE: Not destroying context CUDAdrv.CuContext(Ptr{Void} @0x0000000003f03500,false,false) because we don't own it
TRACE: Freeing CudaPtr CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000002303ee0000,CUDAdrv.CuContext(Ptr{Void} @0x0000000003f03500,false,false))

I am currently travelling but should get to this before long.

warning in julia v0.4

I get this with the latest Julia and don't quite know how to fix it as I think gen_libcudart_h.jl is an automatically generated file.

WARNING: uint32(x) is deprecated, use UInt32(x) instead.
while loading /home/nlg-05/dy_052/kuparser/profile/v0.4/CUDArt/src/../gen-6.5/gen_libcudart_h.jl, in expression starting on line 41

[Question] GPU->CPU Copy Speed?

I had a question about the use of CUDArt.jl in conjunction with CuBLAS.jl. I'm hoping to find out what I can do to improve the speed of memory copying from the GPU device back to CPU-accessible memory. I'm not sure if there is something I'm not implementing properly or if there is something intrinsic that I'm not properly understanding. I've conducted a series of tests using the BLAS Level-3 function, gemm which I detail below (exporting my IJulia notebook over to Markdown...) I should note that I'm using Julia v0.4-rc2 for these experiments.

Specifically, I've observed that for the implementation I detail below, the speed of the GPU->CPU copy! is about 2 orders of magnitude slower than the CPU->GPU copy. Surely I've done something wrong, right? If anyone could illuminate me on what I'm doing wrong, I'd be elated.

Thanks!

Testing CuBLAS and CUDArt for Julia

After finally getting NVCC to work on OSX, we can start using the CUDA-themed BLAS packages written for Julia. In this notebook we will document how to utilize the necessary datatypes and show comparisons between the CPU and GPU implementations of common BLAS functions.

I. Calling and using the Libraries

Lets first make sure that we have updated and built the libraries. Because of the recent changes in Julia between v0.3 and v0.4, we expect quite a number of warnings, and even errors, to pop up during the testing phase. However, the core functionality of the packges should be there.

# Update and build
Pkg.update()
Pkg.build("CUDArt")
Pkg.build("CUBLAS")
using CUDArt
using CUBLAS
using Base.LinAlg.BLAS

II. Experiment Parameters

We will focus our comparisons on the BLAS function gemm which computes
$$ \mathbf{C} \leftarrow \alpha \mathbf{A}\mathbf{B} + \beta \mathbf{C}.$$
We will assume that all of these matrices are dense and real. For our experiments we will set
$\mathbf{A}: (n \times m)$, $\mathbf{B}: (m \times k)$, $\mathbf{C}: (n \times k)$, and
$\alpha = \beta = 1.0$.

# Dimensions
n = 256
m = 784
k = 100
# Scalings
a = 1.0
b = 1.0
# Initialization
A = randn(n,m);
B = randn(m,k);
C = randn(n,k);
whos()
                             A   1568 KB     256x784 Array{Float64,2} : [1.7596โ€ฆ
                             B    612 KB     784x100 Array{Float64,2} : [-1.596โ€ฆ
                          Base  26665 KB     Module : Base
                             C    200 KB     256x100 Array{Float64,2} : [-0.344โ€ฆ
                        CUBLAS    545 KB     Module : CUBLAS
                        CUDArt    573 KB     Module : CUDArt
                        Compat     58 KB     Module : Compat
                          Core   3218 KB     Module : Core
                DataStructures    337 KB     Module : DataStructures
                        IJulia    368 KB     Module : IJulia
                IPythonDisplay     26 KB     Module : IPythonDisplay
                          JSON    195 KB     Module : JSON
                          Main  33724 KB     Module : Main
                       MyGemm!    966 bytes  Function : MyGemm!
                  MyTimedGemm!     15 KB     Function : MyTimedGemm!
                        Nettle    187 KB     Module : Nettle
                           ZMQ     80 KB     Module : ZMQ
                             a      8 bytes  Float64 : 1.0
                             b      8 bytes  Float64 : 1.0
                           d_A     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArtโ€ฆ
                           d_B     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArtโ€ฆ
                           d_C     40 bytes  CUDArt.CudaArray{Float64,2}(CUDArtโ€ฆ
                           dev      8 bytes  Int64 : 0
                             k      8 bytes  Int64 : 100
                             m      8 bytes  Int64 : 784
                             n      8 bytes  Int64 : 256
                        result      0 bytes  Void : nothing

III. Baseline Performance

We will now look at the timing of the base OpenBLAS implementation of gemm, which runs on the CPU, alone.

# Warmpup
gemm!('N','N',a,A,B,b,C);
gemm!('N','N',a,A,B,b,C);
# Time: 5 runs
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
@time gemm!('N','N',a,A,B,b,C);
  0.000769 seconds (4 allocations: 160 bytes)
  0.000797 seconds (4 allocations: 160 bytes)
  0.000810 seconds (4 allocations: 160 bytes)
  0.000917 seconds (4 allocations: 160 bytes)
  0.001528 seconds (4 allocations: 160 bytes)

IV. CUDArt Datatypes

Our first step in being able to use CuBLAS is to initialize our GPU device and make on-device copies of the datastructures we're interested in. Below we detail how to fence off the GPU code and ensure that proper garbage collection is performed on the device via CUDArt.

# Assign Device
device(0)
device_reset(0) 
device(0)
# Create and Copy "A"
d_A = CudaArray(A)
copy!(d_A,A)
# Create and Copy "B"
d_B = CudaArray(B)
copy!(d_B,B)
# Create and Copy "C"
d_C = CudaArray(C)
copy!(d_C,C)
# Show 
println("CUDArt Data Pointer Descriptions")
println(d_A)
println(d_B)
println(d_C)
CUDArt Data Pointer Descriptions
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00a80000),(256,784),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00c20000),(784,100),0)
CUDArt.CudaArray{Float64,2}(CUDArt.CudaPtr{Float64}(Ptr{Float64} @0x0000000d00d20000),(256,100),0)

V. CuBLAS Timings

Now, lets look at the time requirements for just running gemm. We note that this does not include the time of memory copying to and from device memory. For now, lets limit ourselves to the direct comparison of the BLAS function implementation, alone.

# Warmpup
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
# Time: 5 runs
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
@time CUBLAS.gemm!('N','N',a,d_A,d_B,b,d_C);
  0.000033 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000053 seconds (24 allocations: 1.016 KB)
  0.000045 seconds (24 allocations: 1.016 KB)
  0.000037 seconds (24 allocations: 1.016 KB)

So, we can see form the above that we are looking at an order of magnitude improvement in computation time, potentially.

# End Session
device_reset(0)

VI. CuBLAS Timings: With Memory Copying

We will now look at the situation where we want to declare a local function which will conduct all of the necessary device-to-device memory copying requried for the GPU implemenation. Our goal is to see exactly how much advantage we retain in a realistic comparison.

function MyTimedGemm!(tA,tB,a,A,d_A,B,d_B,b,C,d_C)
    # Copy to device
    @printf "(A->d_A)       " 
        @time copy!(d_A,A)
    @printf "(B->d_B)       " 
        @time copy!(d_B,B)
    @printf "(C->d_C)       " 
        @time copy!(d_C,C)
    # Run device-level BLAS
    @printf "(CUBLAS.gemm!) "
        @time CUBLAS.gemm!(tA,tB,a,d_A,d_B,b,d_C)
    # Gather result
    @printf "(d_C->C)       "
        @time copy!(C,d_C)
end

device(0)
device_reset(0)
device(0)

# These pointers can be pre-allocated
d_A = CudaArray(A)
d_B = CudaArray(B)
d_C = CudaArray(C)

# Warmup
println("Warmups============")
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
println("Actual=============")
@time MyTimedGemm!('N','N',a,A,d_A,B,d_B,b,C,d_C);
Warmups============
(A->d_A)         0.000204 seconds
(B->d_B)         0.000317 seconds
(C->d_C)         0.000075 seconds
(CUBLAS.gemm!)   0.078434 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006428 seconds
(A->d_A)         0.000442 seconds
(B->d_B)         0.000361 seconds
(C->d_C)         0.000063 seconds
(CUBLAS.gemm!)   0.000043 seconds (20 allocations: 880 bytes)
(d_C->C)         0.006849 seconds
Actual=============
(A->d_A)         0.000214 seconds
(B->d_B)         0.000307 seconds
(C->d_C)         0.000076 seconds
(CUBLAS.gemm!)   0.000038 seconds (20 allocations: 880 bytes)
(d_C->C)         0.007070 seconds
  0.008016 seconds (199 allocations: 7.813 KB)

We can see that the act of reading the matrix $\mathbf{C}$ back from the device to the CPU actually incurs a huge cost. In fact, the cost is so high as to entirely remove any time advantage we obtain from the CuBLAS implemenation of gemm.

Spurious failures in cudacopy! with "invalid argument" error

This is probably related to #17 and how finalizers work.

The following function:

function test1()
  devices(dev->true, nmax=1) do devlist
    dev = devlist[1]
    device(dev)

    sz = (801, 802)
    x = CudaPitchedArray[]
    for i=1:10
      push!(x, CudaPitchedArray(Float32, sz))
    end
    for i=1:10
      @time for j=1:length(x)
        to_host(x[j])
      end
      println("Finished iteration $i.")
    end
  end
end

produces the following error on the second time it is run. If I run gc() in between the two runs, there is no error.

julia> versioninfo()
Julia Version 0.4.0-dev+2876
Commit f164ac1 (2015-01-22 22:58 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> Test.test1()
elapsed time: 0.014716371 seconds (25716080 bytes allocated)
Finished iteration 1.
elapsed time: 0.071074413 seconds (25716080 bytes allocated, 74.17% gc time)
Finished iteration 2.
elapsed time: 0.010117429 seconds (25716080 bytes allocated)
Finished iteration 3.
elapsed time: 0.063922907 seconds (25716080 bytes allocated, 81.96% gc time)
Finished iteration 4.
elapsed time: 0.010150908 seconds (25716080 bytes allocated)
Finished iteration 5.
elapsed time: 0.06263613 seconds (25716080 bytes allocated, 83.62% gc time)
Finished iteration 6.
elapsed time: 0.010153057 seconds (25716080 bytes allocated)
Finished iteration 7.
elapsed time: 0.062723099 seconds (25716080 bytes allocated, 83.70% gc time)
Finished iteration 8.
elapsed time: 0.06263111 seconds (25716080 bytes allocated, 83.61% gc time)
Finished iteration 9.
elapsed time: 0.01013381 seconds (25716080 bytes allocated)
Finished iteration 10.

julia> Test.test1()
WARNING: CUDA error triggered from:

 in checkerror at /***/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:15
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:313
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:288
 in copy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:282
 in to_host at /***/.julia/v0.4/CUDArt/src/arrays.jl:87
 in anonymous at /***/test.jl:17
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:57
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:49
 in test1 at /***/test.jl:6ERROR: "invalid argument"
 in checkerror at /***/.julia/v0.4/CUDArt/src/libcudart-6.5.jl:16
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:313
 in cudacopy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:288
 in copy! at /***/.julia/v0.4/CUDArt/src/arrays.jl:282
 in to_host at /***/.julia/v0.4/CUDArt/src/arrays.jl:87
 in anonymous at /***/test.jl:17
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:57
 in devices at /***/.julia/v0.4/CUDArt/src/device.jl:49
 in test1 at /***/test.jl:6

This is with the git head version of CUDArt, and probably has something to do with a garbage collection pass trying to collect a Cuda pointer that came from a previous device context (before device_reset called cudaDeviceReset), so that the pointer is invalid in the new device context.

This is very irritating when testing Cuda code in the repl when the same function is run over and over again, sometimes not even correctly, so resetting everything correctly is a must.

CUDArt assumptions not robust

There is some pathing in the following file that presents errors for standard 64 bit CUDA installations in
/root/.julia/v0.5/CUDArt/src/CUDArt.jl

Symptom: Pkg.add("CUDArt") fails

In particular the line:
const libcuda = Libdl.find_library(["libcuda"], ["/usr/lib/", "/usr/local/cuda/lib"])

wants to be
const libcuda = Libdl.find_library(["libcudart"], ["/usr/lib/", "/usr/local/cuda/lib64"])

Three issues:

Since I don't have a 32bit CUDA install, I can't make a more robust suggestion than to say that the environment variable for CUDA_HOME should be checked
I suspect that the find_library command might take another path such that you could put in cuda/lib64 ahead of cuda/lib
As of at least CUDA 8, its now libcudart instead of libcuda

julia> versioninfo()
Julia Version 0.5.1-pre+31
Commit 6a1e339 (2016-11-17 17:50 UTC)
Platform Info:
System: Linux (powerpc64le-linux-gnu)
CPU: unknown
WORD_SIZE: 64
BLAS: libopenblas (NO_AFFINITY POWER8)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.9.0 (ORCJIT, pwr8)

Can CUDArt be loaded with VS2015?

I have the following problem when I try to add CUDArt.

I have already installed VS2015. But it seems that I should install VS2013,2012 or 2010. I want to know if it necessary for me to install another version of Visual Studio. If not, what should I do?

Pkg.add("CUDArt")
INFO: Cloning cache of CUDArt from git://github.com/JuliaGPU/CUDArt.jl.git
INFO: Installing CUDArt v0.2.3
INFO: Building CUDArt
===============================[ ERROR: CUDArt ]================================

LoadError: Cannot find proper Visual Studio installation. VS 2013, 2012, or 2010 is required.
while loading C:\Users\Miller.julia\v0.4\CUDArt\deps\build.jl, in expression starting on line 10

================================[ BUILD ERRORS ]================================

WARNING: CUDArt had build errors.

  • packages with build errors remain installed in C:\Users\Miller.julia\v0.4
  • build the package(s) and all dependencies with Pkg.build("CUDArt")
  • build a single package by running its deps/build.jl script

INFO: Package database updated

Updated build script for visual studio 17 but get compile errors

Updated build script targeting changes to build directory and so forth for visual studio 2017.

using Compat

if is_windows()

    vswhere = download("https://github.com/Microsoft/vswhere/releases/download/1.0.58/vswhere.exe")

    vs_install_path = chomp(readstring(`$vswhere  -latest -property installationPath`))
    if !isdir(vs_install_path)
        error("Cannot find a proper Visual Studio installation. Make sure Visual Studio is installed.")
    end

    vs_version_major = parse(split(chomp(readstring(`$vswhere  -latest -property installationVersion`)), '.')[1])
    if vs_version_major >= 15
        vs_cmd_prompt = joinpath(vs_install_path, "VC", "Auxiliary", "Build", "vcvarsall.bat")
    else
        vs_cmd_prompt = joinpath(vs_install_path, "VC", "vcvarsall.bat")
    end

    # check whether 32 or 64 bit archtecture
    # NOTE: Actually, nvcc in x86 visual studio command prompt doesn't make 32-bit binary
    #       It depends on whether CUDA toolkit is 32bit or 64bit
    if Int == Int64
        arch = "amd64"
    else
        arch = "x86"
    end

    # Run nmake -f Windows.mk under visual studio command prompt
    cd(@__DIR__) do
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk clean`)
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk`)
    end

    cd(joinpath(@__DIR__, "..", "test")) do
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk clean`)
        run(`cmd /C "$vs_cmd_prompt" $arch \& nmake -f Windows.mk`)
    end
else # for linux or mac
    cd(@__DIR__) do
        run(`make clean`)
        run(`make`)
    end
    cd(joinpath(@__DIR__, "..", "test")) do
        run(`make clean`)
        run(`make`)
    end
end

however gives the following problem

julia> include("C:\\Users\\Mus\\.julia\\v0.5\\CUDArt\\deps\\build.jl")
**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.0.26228.12
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

Microsoft (R) Program Maintenance Utility Version 14.10.25017.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        del /Q libwrapcuda.dll libwrapcuda.lib libwrapcuda.exp
Could Not Find C:\Users\Mus\.julia\v0.5\CUDArt\deps\libwrapcuda.dll
        del /Q utils.ptx
Could Not Find C:\Users\Mus\.julia\v0.5\CUDArt\deps\utils.ptx
**********************************************************************
** Visual Studio 2017 Developer Command Prompt v15.0.26228.12
** Copyright (c) 2017 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'

Microsoft (R) Program Maintenance Utility Version 14.10.25017.0
Copyright (C) Microsoft Corporation.  All rights reserved.

        nvcc --shared --compiler-options="/wd4819" --linker-options= wrapcuda.c -o libwrapcuda.dll
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Microsoft (R) C/C++ Optimizing Compiler Version 19.10.25017 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tmpxft_00000a30_00000000-1.cpp
nvcc fatal   : Host compiler targets unsupported OS.
NMAKE : fatal error U1077: '"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.EXE"' : return code '0x1'
Stop.
ERROR: LoadError: failed process: Process(`cmd /C 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat' amd64 & nmake -f Windows.mk`, ProcessExited(2)) [2]
 in pipeline_error(::Base.Process) at .\process.jl:616
 in run(::Cmd) at .\process.jl:592
 in cd(::##1#4, ::String) at .\file.jl:48
 in include_from_node1(::String) at .\loading.jl:488
while loading C:\Users\Mus\.julia\v0.5\CUDArt\deps\build.jl, in expression starting on line 29

Types with CudaArray elements cannot be saved to JLD because of pointer exception

I'd like to fix this so people can save and load machine learning models that use CudaArray's without having to explicitly copy everything to cpu. Is the right way to overwrite serialize - deserialize? Or is it to introduce a new cpu array type that CudaArrays know to convert themselves to and from during load/save? Or is there some other way?

Failed to install CUDArt on Windows 7

I have Julia 0.4.1 and CUDA 6.5 installed on my PC.
When I tried to install CUDArt.jl, I got the following erros:

    nvcc --shared --compiler-options="/wd4819" --linker-options= wrapcuda.c

-o libwrapcuda.dll
Internal error
NMAKE : fatal error U1077: '"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA
v6.5\bin\nvcc.EXE"' : return code '0xc0000005'
Stop.
===============================[ ERROR: CUDArt ]================================

LoadError: failed process: Process(cmd /C 'C:\Program Files (x86)\Microsoft Vis ual Studio 12.0\Common7\Tools\..\..\VC\vcvarsall.bat' amd64 & nmake -f Windows.m k, ProcessExited(2)) [2]
while loading C:\Users\liu.julia\v0.4\CUDArt\deps\build.jl, in expression start
ing on line 23

If anyone had the same problem? or any help from you?
Thank you so much.

Gengdai

clone into "CUDArt" folder instead of "CUDArt.jl"

Currently the package gets cloned into a "CUDArt.jl" directory and fails to build. It turns out the build code looks for a "CUDArt" folder. Renaming the folder after cloning is enough to allow me to build successfully.

Tests fail on Windows with 0.6

Package builds without complaints but tests fail:

<a few warnings suppressed here>

ERROR: LoadError: LoadError: AssertionError: !(isactive(pctx))
Stacktrace:
 [1] unsafe_reset!(::CUDAdrv.CuPrimaryContext, ::Bool) at C:\Users\amellnik\.julia\v0.6\CUDAdrv\src\context\primary.jl:104
 [2] reset(::CUDAdrv.CuPrimaryContext) at .\deprecated.jl:59
 [3] device_reset(::Int64) at C:\Users\amellnik\.julia\v0.6\CUDArt\src\device.jl:44
 [4] macro expansion at C:\Users\amellnik\.julia\v0.6\CUDArt\test\gc.jl:22 [inlined]
 [5] anonymous at .\<missing>:?
 [6] include_from_node1(::String) at .\loading.jl:569
 [7] include(::String) at .\sysimg.jl:14
 [8] include_from_node1(::String) at .\loading.jl:569
 [9] include(::String) at .\sysimg.jl:14
 [10] process_options(::Base.JLOptions) at .\client.jl:305
 [11] _start() at .\client.jl:371
while loading C:\Users\amellnik\.julia\v0.6\CUDArt\test\gc.jl, in expression starting on line 8
while loading C:\Users\amellnik\.julia\v0.6\CUDArt\test\runtests.jl, in expression starting on line 1
===============================[ ERROR: CUDArt ]================================

failed process: Process(`'C:\Users\amellnik\AppData\Local\Julia-0.6.0\bin\julia.exe' -Cx86-64 '-JC:\Users\amellnik\AppData\Local\Julia-0.6.0\lib\julia\sys.dll' --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes 'C:\Users\amellnik\.julia\v0.6\CUDArt\test\runtests.jl'`, ProcessExited(1)) [1]

================================================================================
CUDArt had test errors

Stacktrace:
 [1] #test#62(::Bool, ::Function, ::Array{AbstractString,1}) at .\pkg\entry.jl:757
 [2] (::Base.Pkg.Entry.#kw##test)(::Array{Any,1}, ::Base.Pkg.Entry.#test, ::Array{AbstractString,1}) at .\<missing>:0
 [3] (::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}})() at .\pkg\dir.jl:36
 [4] cd(::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}}, ::String) at .\file.jl:59
 [5] #cd#1(::Array{Any,1}, ::Function, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N} where N) at .\pkg\dir.jl:36
 [6] (::Base.Pkg.Dir.#kw##cd)(::Array{Any,1}, ::Base.Pkg.Dir.#cd, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N} where N) at .\<missing>:0
 [7] #test#3(::Bool, ::Function, ::String, ::Vararg{String,N} where N) at .\pkg\pkg.jl:276
 [8] test(::String, ::Vararg{String,N} where N) at .\pkg\pkg.jl:276
 [9] include_string(::String, ::String) at .\loading.jl:515

This stracktrace is a bit cryptic, but it appears to stem from the next-to-last line in this basic test:

devlist = CUDArt.devices(dev->true)
for dev in devlist
    CUDArt.device(dev)
    p = CUDArt.malloc(UInt8, 1)
    p2 = CUDArt.malloc(UInt16, 100)
    CUDArt.free(p)
    CUDArt.free(p2)
    a = rand(5,3)
    g = CUDArt.CudaArray(a)
    gp = CUDArt.CudaPitchedArray(a)
    CUDArt.free(g)
    CUDArt.free(gp)
    # Also test finalizer calls
    g = CUDArt.CudaArray(a)
    g = CUDArt.CudaPitchedArray(a)
    CUDArt.device_reset(dev)
end

Unified Memory support

The following code reproduces the Unified Memory example from NVIDIA in Julia:
https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-add_cudart-jl

Kernel run time is the same as with the .cu compiled with nvcc, the nvprof output I get is this:

Time(%)      Time     Calls       Avg       Min       Max  Name
 61.18%  871.81us        11  79.255us  78.689us  79.872us  julia_kernel_add_61609
 38.82%  553.09us        11  50.280us  48.832us  53.344us  julia_kernel_init_61427

I decided to attempt to make the interface a little nicer, by creating a UnifiedArray type modeled after CuDeviceArray, represented in this file together with the test:
https://gist.github.com/barche/9cc583ad85dd2d02782642af04f44dd7#file-unifiedarray-jl

Unfortunately, this runs significantly slower:

Time(%)      Time     Calls       Avg       Min       Max  Name
 56.90%  1.0317ms        11  93.792us  91.520us  100.48us  julia_kernel_add_61608
 41.03%  743.85us        11  67.622us  54.369us  77.472us  julia_kernel_init_61428
  2.07%  37.536us        55     682ns     640ns  1.1520us  [CUDA memcpy HtoD]

Comparing the @code_llvm output for the init kernel after the ifshows for the first version:

  %16 = getelementptr float, float* %1, i64 %15, !dbg !21
  %17 = getelementptr float, float* %0, i64 %15, !dbg !20
  store float 1.000000e+00, float* %17, align 8, !dbg !20, !tbaa !22
  store float 2.000000e+00, float* %16, align 8, !dbg !21, !tbaa !22
  br label %L47, !dbg !21

and for the UnifiedArray version:

  %16 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %0, i64 0, i32 0, !dbg !23
  %17 = add i64 %12, -1, !dbg !23
  %18 = load float*, float** %16, align 8, !dbg !23, !tbaa !20
  %19 = getelementptr float, float* %18, i64 %17, !dbg !23
  store float 1.000000e+00, float* %19, align 8, !dbg !23, !tbaa !24
  %20 = getelementptr inbounds %UnifiedArray.4, %UnifiedArray.4* %1, i64 0, i32 0, !dbg !26
  %21 = load float*, float** %20, align 8, !dbg !26, !tbaa !20
  %22 = getelementptr float, float* %21, i64 %17, !dbg !26
  store float 2.000000e+00, float* %22, align 8, !dbg !26, !tbaa !24
  br label %L47, !dbg !26

So now for the questions:

  • Where does this difference in performance come from, and is it possible to keep the array abstraction and have it perform as well as the pointer version?
  • Are there any plans to add an array based on the Unified Memory model?
  • Are there any plans to wrap the CUDA8 functions, such as cudaMemPrefetchAsync?

p.s. great job on all these CUDA packages, this was a lot easier to set up than I had anticipated :)

CUDArt fails to build when no CUDA device is present

julia> Pkg.build("CUDArt")
INFO: Building CUDAdrv
INFO: Found libcuda at /usr/bin/../lib/libcuda.so
INFO: CUDAdrv.jl has already been built for this CUDA library, no need to rebuild
INFO: Building CUDArt
NVIDIA: no NVIDIA devices found
=================================================[ ERROR: CUDArt ]=================================================

LoadError: CUDA error 30 calling cudaRuntimeGetVersion
while loading /home/wallnuss/.julia/v0.5/CUDArt/deps/build.jl, in expression starting on line 380

Support various versions of CUDA Toolkit

Current CUDArt.jl use generated api bindings from CUDA Toolkit 6.5. (libcudart-6.5.jl and files in gen-6.5/)
And CUDA 5.5 files seems to be not used. (libcudart-5.0.jl and files in gen/)

I simplely tested current 6.5 api binding with CUDA 5.0/5.5/6.0/6.5 and even 7.0 RC.
(Yeah, I want to use boring time in this afternoon to just click install buttons. It is better test in Windows rather than linux.)
By result, current 6.5 binding works well in all those versions because we don't use any special new functions in recent cuda runtimes.

So my suggestion is:

  • Remove api bindings from CUDA 5.0 and keep only latest api binding.
  • Use find_library for libcudart with version 5.0 - 7.0 and we can say CUDArt.jl supports all these versions.
  • If there's new functionality that uses api from certain version of cuda runtime, we have to check the version of installed cuda toolkit.

How do you think?

gcc5.4.0 support

Hello, I am using ubuntu 16.04 and the gcc version is 5.4.0. I find that cuda v8.0.61 has removed the limit of gcc version no more than 5.3, so I changed build.jl, line 154 to (v"8.0", v"5.4.0"). I think it makes sense so ubuntu users can make use of your masterpiece. Thanks a lot!

`devcount()` return zero when none available?

Currently, devcount() fails with an error if there's no device available, because in that case, it returns the cudaErrorNoDevice error code. It makes sense for most functions to fail when there's no device, but I'd hope devcount() to return zero so I can use it to detect the presence of a CUDA-capable GPU.

Or is there another way to determine the availability of a device? (My goal is to not run the GPU-related tests of my package when there's no GPU.)

If you agree with this suggestion, I can write a PR for it, I hope it's as easy as adding that function to this list.

CUDArt should not rely on `nvidia-smi` or `nvml` on Mac OSX

This is on master for CUDArt:

julia> versioninfo()
Julia Version 0.6.0-pre.beta.437
Commit 552626cc97 (2017-04-30 05:06 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin16.5.0)
  CPU: Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libgfortblas
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)


julia> Pkg.build("CUDArt")
INFO: Building CUDAdrv
INFO: CUDAdrv.jl has already been built for this CUDA library, no need to rebuild.
INFO: Building CUDArt
WARNING: NVML not found, resorting to nvidia-smi
===============================[ ERROR: CUDArt ]================================

LoadError: could not spawn `nvidia-smi`: no such file or directory (ENOENT)
while loading /Users/solver/.julia/v0.6/CUDArt/deps/build.jl, in expression starting on line 366

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: CUDArt had build errors.

 - packages with build errors remain installed in /Users/solver/.julia/v0.6
 - build the package(s) and all dependencies with `Pkg.build("CUDArt")`
 - build a single package by running its `deps/build.jl` script

================================================================================

GCC Version On CUDA 8.0

Hi!
So the max version that this package's build script will allow for gcc on CUDA 8.0 is 5.3.1. However, even Xenial has moved on with gcc-5 up to 5.4.0 at least. https://packages.ubuntu.com/xenial/gcc-5 This makes it impossible to build on anything newer than a slightly out of date 16.04 machine without hacking the build.jl. I use 17.04, so for me it's just flat out ridiculous to get anything older than 5.4.1 (which works fine with CUDA).

I propose bumping the build.jl requirement from 5.3.1 to 5.4.1. Thanks!

device_reset can throw errors and not reset the device

As it's implemented now, all wrappers of cuda functions check returns results for cudaSuccess and throw an error if the result is not cudaSuccess. This means that if for some reason, in device_reset: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L10 some function returns an error (e.g., a previous kernel launch by the user failed, or, as happens to me, irreproducibly (so not a bug I think) that one of cuda_ptrs is "an invalid device pointer", the clean up code throws an error and never executes cudaDeviceReset().

Also, I believe that cleaning up pointers before cudaDeviceReset() is unnecessary because that function's documentation says it releases all resources associated with the current device and process. So it isn't really necessary to free the pointers at all, they should be cleaned up by cudaDeviceReset.

This also means that in devices: https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/device.jl#L59 the finally-clause can throw errors, and the first error interrupts the whole finally-clause, preventing the devices from being reset correctly.

The bug is that code in finally-clause should never throw errors in a way that prevents resources/devices from being cleaned up.

Runtime kernel compilation in CUDA 7

Reference: http://www.soa-world.de/echelon/2015/01/cuda-7-runtime-compilation.html

CUDA 7 is in release candidate state now and it has very interisting feature: runtime kernel compilation.
It is just like how OpenCL works. We can just pass kernel source as string and get CuFunction object.

I think it is very good news for CUDArt package, we can write CUDA kernels more easily rather than use external nvcc compiler to get PTX files.

Another usage is metaprogramming for generating kernels.
For example, @kk49 presented interesting concept to write GPU code for arithemetic operations https://github.com/kk49/julia-delayed-matrix

julia-delayed-matrix generates PTX code directly, but we can do better by generate .cu code in string.

I'm NVIDIA registered programmer and got fresh CUDA 7.0 RC. First work have to be finding how runtime kernel compilation API works. And we need such a gen-7.0 stuff.

Gsoc proposal

Hello,

I intend to apply to Google Summer of Code by using the idea Writing high-performance, multithreaded kernels for image processing and my proposal, will use CUDArt.jl to improve the parallel image processing performance.

I'm writing here to ask for some feedback on what should I know first about CUDArt.jl to start working on my project or if you guys could help me on pointing how I can improve the idea.

This is my project proposal, feel free to have a look and comment on the dock or here in the thread.

Thanks.

Naelson Douglas

void * type cuda malloc does not work

Allocating memory in the GPU works super well for most types. E.g.

julia> CUDArt.malloc( Cint, 10 )
CUDArt.CudaPtr{Int32}(Ptr{Int32} @0x00000013047a0000)

However, when trying to allocate memory for a void pointer (as e.g., the CUB library requires for some functions), the returned pointer is always 0:

julia> CUDArt.malloc( Void, 10 )
CUDArt.CudaPtr{Void}(Ptr{Void} @0x0000000000000000)

Cannot find library libwrapcuda

julia> using CUDArt                                                                       
INFO: Precompiling module CUDArt...                                                       
ERROR: LoadError: LoadError: Cannot find libwrapcuda                                      
 in error at ./error.jl:21                                                                
 in include at ./boot.jl:261                                                              
 in include_from_node1 at ./loading.jl:320                                                
 in include at ./boot.jl:261                                                              
 in include_from_node1 at ./loading.jl:320                                                
 [inlined code] from none:2                                                               
 in anonymous at no file:0                                                                
 in process_options at ./client.jl:257                                                    
 in _start at ./client.jl:378                                                             
while loading /home/rluser/.julia/v0.4/CUDArt/src/libcudart-6.5.jl, in expression starting
 on line 53                                                                               
while loading /home/rluser/.julia/v0.4/CUDArt/src/CUDArt.jl, in expression starting on lin
e 27                                                                                      
ERROR: Failed to precompile CUDArt to /home/rluser/.julia/lib/v0.4/CUDArt.ji              
 in error at ./error.jl:21                                                                
 in compilecache at loading.jl:400                                                        
 in require at ./loading.jl:266                                                           

julia> Libdl.find_library(["libwrapcuda"],["/home/rluser/.julia/v0.4/CUDArt/deps/"])      
""     

julia> libwrapcuda = Libdl.find_library(["libwrapcuda"],[joinpath(Pkg.dir(), "CUDArt", "de
ps")])                                                                                    
""    

julia> cd(joinpath(Pkg.dir(), "CUDArt", "deps"))

shell> ls
build.jl  libwrapcuda.so  Makefile  utils.cu  utils.ptx  Windows.mk  wrapcuda.c

Makefile needs to select correct gcc compiler

Plainly invoking nvcc isn't guaranteed to work:

nvcc -ptx -gencode=arch=compute_20,code=sm_20 utils.cu
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
In file included from /opt/cuda/bin/..//include/cuda_runtime.h:78:0,
                 from <command-line>:0:
/opt/cuda/bin/..//include/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 5 are not supported!
 #error -- unsupported GNU version! gcc versions later than 5 are not supported!
  ^~~~~
Makefile:15: recipe for target 'utils.ptx' failed
make: *** [utils.ptx] Error 1

Luckily, I've already implemented such a mechanism as part of CUDAdrv. Maybe we should move this functionality, and the entire compilation example, over to CUDArt (since CUDAdrv doesn't require nvcc to be installed except for that example)?

Compilation error

Hello

I'm trying to get CUDArt going on a HPC platform (meaning some libraries are in non-standard places). Could anyone shed light on the error below?

julia> using CUDArt
INFO: Precompiling module CUDArt...
ERROR: LoadError: LoadError: Cannot find libwrapcuda
in error at ./error.jl:21
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
[inlined code] from none:2
in anonymous at no file:0
in process_options at ./client.jl:252
in _start at ./client.jl:375
while loading /home/mcp50/.julia/v0.5/CUDArt/src/libcudart-6.5.jl, in expression starting on line 53
while loading /home/mcp50/.julia/v0.5/CUDArt/src/CUDArt.jl, in expression starting on line 27
ERROR: Failed to precompile CUDArt to /home/mcp50/.julia/lib/v0.5/CUDArt.ji
in error at ./error.jl:21
in compilecache at loading.jl:383
in require at ./loading.jl:250

broken in Julia v0.3

When I start with a clean environment and try to add and use CUDArt to Julia3 I get the following error:

julia> using CUDArt
ERROR: Ref not defined
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in include at ./boot.jl:245
 in include_from_node1 at ./loading.jl:128
 in reload_path at loading.jl:152
 in _require at loading.jl:67
 in require at loading.jl:51
while loading /home/nlg-05/dy_052/julia-pkgdir/v0.3/CUDArt/src/stream.jl, in expression starting on line 46
while loading /home/nlg-05/dy_052/julia-pkgdir/v0.3/CUDArt/src/CUDArt.jl, in expression starting on line 53

Can we tell Julia3 to use an older version by default maybe? Or use more Compat? What is the right way to maintain a package to be compatible with multiple versions? (I am asking because I need to keep a couple of packages functional in Julia3 and Julia4 for a while... :)

display CU_PARAM_TR_DEFAULT in checkdrv and what is it?

When I try this function on a server with Tesla GPU(device initialize and other staff is outside the function) , cuModuleLoad returns -1 in the console. But I didn't find this return in the cuModuleLoad's document. (All the returns are positive)

WARNING: /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.ptx
ERROR: LoadError: KeyError: -1 not found
 in checkdrv at /home/quaninfo/rogerluo/.julia/v0.4/CUDArt/src/module.jl:14
 in call at /home/quaninfo/rogerluo/.julia/v0.4/CUDArt/src/module.jl:24
 in diagexp at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.jl:19
 in diagexp at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/utils/cuda/cuMatrix.jl:29
 in realtimeop! at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/Adiabatic/timeop.jl:5
 in next_timestep! at /home/quaninfo/rogerluo/.julia/v0.4/Quantize/src/Adiabatic/timeop.jl:20
 [inlined code] from util.jl:155
 in adia at /home/quaninfo/rogerluo/cooling-12.jl:8
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/quaninfo/rogerluo/cooling-12.jl, in expression starting on line 30
function diagexp(A::CudaArray{Complex64})
    md = CuModule("$dir/src/utils/cuda/cuMatrix.ptx", false)
    diagexp = CuFunction(md, "diagexp_cf")
    nsm = attribute(device(), rt.cudaDevAttrMultiProcessorCount)
    mul = min(32, ceil(Int, length(A)/(256*nsm)))
    expH = CudaArray(Complex64,size(A)...)
    launch(diagexp, mul*nsm, 256, (A,expH,length(A)))
    return expH
end

The CUDA document describe this as

For texture references loaded into the module, use default texunit from texture reference.

But it works fine on my own laptop with a GT730M GPU.

As a freshman in CUDA, I'm not familiar with this error/warning, is there anyone who knows how to solve it?

Unstable performance

Disclaimer: I used matrix multiplication from CUBLAS.jl as an example operations since CUDArt.jl doesn't provide anything like that, so results may be biased because of it. Anyway, I'll be glad to see any pointers.

With random CudaArray and identity matrix like this:

const A = rand(1024, 256)
const Im = eye(256, 256)
const d_A = CudaArray(A)
const d_Im = CudaArray(Im)

I do several performance tests like this:

CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A)

If you are not familiar with BLAS (or just don't like cryptic names), this code multiplies d_A by identity matrix d_Im and puts the result to d_A again. When I run same test on CPU, I always get very similar, consistent results. But on GPU benchmarks give totally different results:

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.018428 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.238584 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   2.931953 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   3.775450 seconds (25.00 k allocations: 937.500 KB)

# after some time
 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   0.020394 seconds (25.00 k allocations: 937.500 KB)

 julia> @time for i=1:1000 CUBLAS.gemm!('N', 'N', 1., d_A, d_Im, 0., d_A) end
   4.812287 seconds (25.00 k allocations: 937.500 KB)

So first call is really fast, but all subsequent calls take ~200x longer. After you wait some time (say, 10 seconds), multiplication becomes fast again, but also for one single test and then drops again.

Is it expected behavior? Do I use CudaArrays correctly at all?

cannot initialize CudaArray with Int32 size

This took me a while to figure out today, while debugging the new CUSPARSE package:

ERROR: LoadError: MethodError: `convert` has no method matching convert(::Type{CUDArt.CudaArray{T,N}}, ::Type{Int32}, ::Tuple{Int32})
This may have arisen from a call to the constructor CUDArt.CudaArray{T,N}(...),
since type constructors fall back to convert methods.
Closest candidates are:
  CUDArt.CudaArray(::Type{T}, !Matched::Integer...)
  CUDArt.CudaArray(::Type{T}, !Matched::Tuple{Vararg{Int64}})
  call{T}(::Type{T}, ::Any)
  ...

The relevant part from arrays.jl:

CudaArray(T::Type, dims::Integer...) = CudaArray(T, dims)

function CudaArray(T::Type, dims::Dims)

i.e. CudaArray(T,Int32) is happy, but CudaArray(T,(Int32,)) is not. Dims is defined strictly as Int64:

Dims => Tuple{Vararg{Int64}}

Should Dims also be Tuple{Vararg{Integer}} ? Maybe this is a Base problem...

device_reset not exported

Was this an intentional omission? You can't follow along with the README unless you explicitly import device_reset.

CudaArray doesn't have a finalizer, CudaPitchedArray does

I was looking at how long a pointer generated with CudaArray is guaranteed to stay alive for, and I noticed that CudaArray doesn't call finalizer(..., free) on itself in its constructor, so it looks like its pointer will be alive until the array gets freed explicitly, while CudaPitchedArray has a very different behaviour, and losing a reference to a CudaPitchedArray seems to deallocate the underlying pointer.

Is there a reason for implementing such different behaviours? It doesn't get described in the README, and I think it would be less confusing to have the two array types behave similarly (one way or the other). I don't know what was intended there, but I was expecting CudaArray to be finalized as well.

Another confusing thing is that malloc adds its own finalizer (https://github.com/JuliaGPU/CUDArt.jl/blob/master/src/pointer.jl#L38), but if the reference to a CudaPitchedArray gets lost, doesn't the pointer get freed regardless of whether there are still references to the pointer itself?

I'm not really saying any of this is strictly wrong, just that the rules governing malloc, free, and finalizers should be as clear as possible.

error could not load library "libnvidia-ml"

julia> versioninfo()
Julia Version 0.6.0-pre.alpha.325
Commit 980119a* (2017-03-30 16:30 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> Pkg.test("CUDArt")
INFO: Testing CUDArt
ERROR: LoadError: LoadError: error compiling #devices#1: error compiling filter_free: could not load library "libnvidia-ml"
The specified module could not be found.

Stacktrace:
 [1] devices(::Function) at C:\Users\Mus\.julia\v0.6\CUDArt\src\device.jl:52
 [2] include_from_node1(::String) at .\loading.jl:539
 [3] include(::String) at .\sysimg.jl:14
 [4] include_from_node1(::String) at .\loading.jl:539
 [5] include(::String) at .\sysimg.jl:14
 [6] process_options(::Base.JLOptions) at .\client.jl:305
 [7] _start() at .\client.jl:371
while loading C:\Users\Mus\.julia\v0.6\CUDArt\test\gc.jl, in expression starting on line 7
while loading C:\Users\Mus\.julia\v0.6\CUDArt\test\runtests.jl, in expression starting on line 1
================================================================================[ ERROR: CUDArt ]=================================================================================


failed process: Process(`'C:\Julia\Julia-0.6-latest\bin\julia' -Cnative '-JC:\Julia\Julia-0.6-latest\lib\julia\sys.dll' --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes 'C:\Users\Mus\.julia\v0.6\CUDArt\test\runtests.jl'`, ProcessExited(1)) [1]

==================================================================================================================================================================================

ERROR: CUDArt had test errors

Cannot find CUDA runtime API in Ubuntu 14.04, CUDA 6.5

Before send PR, I tested the code in my linux system.
But even in master branch, I got error that cannot find cuda runtime api.

I had to update the location of libcudart as below commit: /usr/local/cuda -> /usr/local/cuda/lib64

moon6pence@923e05c

libcuda has no problem to load, but I update the location that really it is.

OOB during package build

Building CUDArt fails. After solving the rest of problems I encountered in the way, I get this error now, which I cannot solve

LoadError: BoundsError: attempt to access 0-element Array{String,1} at index [1]
...

I'm not an expert on this, so I can give more details if you want, but I preferred to keep the thread as simple as possible.

Conversion error in drawing the Julia set

I'm trying to use the following code to draw the Julia set (using Julia 0.3.5):

julia.jl

using CUDArt
using PyPlot

w = 2048 * 2
h = 2048 * 2
q = [complex64(r, i) for i = 1 : -(2.0 / w) : -1, r = -1.5 : (3.0 / h) : 1.5]

julia(q :: Array{Complex64}, maxiter :: Integer) = begin
    res = CUDArt.devices(dev -> CUDArt.capability(dev)[1] >= 2, nmax = 1) do devlist
        CUDArt.device(devlist[1])

        prg = CUDArt.CuModule("julia.ptx") do md
            juliafn = CUDArt.CuFunction(md, "julia")
            out = CUDArt.CudaArray(Uint16, size(q))
            CUDArt.launch(juliafn, length(q), 1, (q, out, uint16(maxiter)))
            CUDArt.to_host(out)
        end
    end

    res
end

j = julia(q, 200)
imshow(j, extent = [-1.5, 1.5, -1.0, 1.0])

julia.cu (compiled to julia.ptx via nvcc -ptx julia.cu)

extern "C"
{
    __global__ void julia(float2* q, unsigned short* out, const unsigned short maxiter)
    {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;

        float nreal = 0.0f;
        float real = q[idx].x;
        float imag = q[idx].y;

        out[idx] = 0;

        for (int i = 0; i < maxiter; i++)
        {
            if (real * real + imag * imag > 4.0f)
            {
                out[idx] = i;
            }

            nreal = real * real - imag * imag + (-0.5f);
            imag = 2 * real * imag + 0.75f;
            real = nreal;
        }
    }
}

When trying to run the code, it exits with the error 'rawpointer' has no method matching rawpointer(::Array{Complex{Float32},2}). Is this a lacking convert for the type, or is there something wrong with my code? Sorry for not posting this on StackOverflow or /r/julia but I expect people here might be better informed about the peculiarities of CUDArt itself.

triggering gc based on gpu memory

Is there any progress on making gc sensitive to remaining gpu memory? The following example still fails with an out-of-memory error. It works if you uncomment the manual gc() line.

using CUDArt
for i=1:1000
    a=CudaArray(Float64,1000000)
    # i%100==0 && gc()                                                                                                     
end

Support for ptx modules with external functions

I'm not certain whether this issue is most appropriately situated in the CUDArt or CUDAdrv repositories or both. I'm posting in both, but will remove it from one or the other if advised so.

I am interesting in having the ability to compile ptx modules that include external functions in them and then import those as functions to use/launch from within Julia. The particular example I was recently working with was for CUBLAS functions, but the principal is far wider. I inquired about the issue on Stack Overflow here. I had thought that it would be relatively manageable, but from the answer I got, it actually sounds like it is relatively complex and involved. On the plus side, it does appear that there are precedents for establishing this kind of capability, e.g. with the JCUDA framework for Java.

I could potentially assist with such an implementation, but I doubt I'd be well positioned to take it on all myself.

Thoughts?

strange splatting bug

I am not sure if this bug belongs here or in the base or if I should get my head checked but this got me stumpted. And it does not happen when I replace CudaArray's with Array's. So I am posting here:

using CUDArt

function bmultest5(a::CudaArray, d::Dims)
    @show typeof(size(a))
    @show size(a)
    b = reinterpret(eltype(a), a, d)
    @show typeof(size(b))
    @show size(b)
    @show Cint[size(b)...]
end

bmultest5(CudaArray(zeros(2)), (2,1,1,1))

gives

typeof(size(a)) = Tuple{Int64}
size(a) = (2,)
typeof(size(b)) = Tuple{Int64,Int64,Int64,Int64}
size(b) = (2,1,1,1)
Cint[size(b)...] = Int32[2,1]

i.e. if size(b) has more than 2 elements, only the first two show up in the result of Cint[size(b)...]!

error in running finalizer: ErrorException("auto_unbox: unable to determine argument type")

Continuing from: https://groups.google.com/d/topic/julia-dev/NqPz4f_0VLg/discussion

I get this error intermittently on exit from Julia. I was able to trace it to the finalizer of CudaPtr (pointer.jl:44), in particular to the statement haskey(cuda_ptrs, p). Here is what I know so far:

  • If I comment out the CudaPtr finalizer (pointer.jl:38) the error disappears.
  • If I comment out haskey(cuda_ptrs, p) (pointer.jl:46) the error disappears.
  • The error does not consistently appear: Sometimes if a gc or a user-free occurs before Julia exit and calls the CudaPtr finalizer the error disappears. I am still trying to pinpoint the exact condition.
  • The debugger backtrace looks like this is happening during compilation maybe? However I am having trouble accessing the C variables in gdb so I am not sure: https://gist.github.com/denizyuret/161cf7e8b79266809a27

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.