Coder Social home page Coder Social logo

opencl.jl's People

Contributors

carstenbauer avatar clarohenrique avatar daveh19 avatar davidbp avatar dfdx avatar dilumaluthge avatar eschnett avatar jakebolewski avatar jpsamaroo avatar juliatagbot avatar juliohm avatar lcw avatar lwabeke avatar maleadt avatar miakramer avatar nstiurca avatar ranocha avatar sambitdash avatar simondanisch avatar tkelman avatar vchuravy avatar yuyichao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opencl.jl's Issues

Remaining compatibility issues with OpenCL.jl and Julia v0.4

Hello,

Many thanks for developing this package! I have found it quite useful.

Now that Julia v0.4 has been released, will OpenCL.jl move to the v0.4 environment? I noticed several deprecation warnings when loading OpenCL.jl into Julia v0.4.

OpenCL 2.0 problems

Using OpenCL.Device(Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz on AMD Accelerated Parallel Processing) (AMD's OpenCL has version 2.0), I get the following

julia> Pkg.test("OpenCL")
INFO: Computing test dependencies for OpenCL...
INFO: No packages to install, update or remove
INFO: Testing OpenCL
OpenCL.Platform
  > Platform Info
    Failure :: (line:-1) :: Platform Info :: fact was false
      Expression: v.major --> 1
        Expected: 1
        Occurred: 2
  > Platform Equality
Out of 25 total facts:
  Verified: 24
  Failed:   1
OpenCL.Context
  > OpenCL.Context constructor
  > OpenCL.Context platform properties
  > OpenCL.Context create_some_context
  > OpenCL.Context parsing
31 facts verified.
OpenCL.Device
  > Device Type
  > Device Equality
  > Device Info
142 facts verified.
OpenCL.CmdQueue
  > OpenCL.CmdQueue constructor
  > OpenCL.CmdQueue info
43 facts verified.
OpenCL.Macros
  > OpenCL.Macros version platform
  > OpenCL.Macros version device
12 facts verified.
OpenCL.Event
  > OpenCL.Event status
  > OpenCL.Event wait
  > OpenCL.Event callback
Test Callback
Test Callback
32 facts verified.
OpenCL.Program
  > OpenCL.Program source constructor
  > OpenCL.Program info
  > OpenCL.Program build
  > OpenCL.Program source code
  > OpenCL.Program binaries
44 facts verified.
OpenCL.Kernel
  > OpenCL.Kernel constructor
  > OpenCL.Kernel info
  > OpenCL.Kernel mem/workgroup size
  > OpenCL.Kernel set_arg!/set_args!
  > OpenCL.Kernel enqueue_kernel
74 facts verified.
INFO: ======================================================================
                              Running Behavior Tests
      ======================================================================
OpenCL Hello World Test
2 facts verified.
OpenCL Low Level Api Test
2 facts verified.
OpenCL Struct Buffer Test

signal (11): Segmentation fault
unknown function (ip: 0x7f702310d7bf)
unknown function (ip: 0x7f70230f083b)
unknown function (ip: 0x7f70230f09cf)
unknown function (ip: 0x7f7023087c99)
unknown function (ip: 0x7f7022867182)
unknown function (ip: 0x7f702286842a)
unknown function (ip: 0x7f70228685fc)
unknown function (ip: 0x7f702283ce7a)
unknown function (ip: 0x7f702286e8be)
unknown function (ip: 0x7f7022873236)
unknown function (ip: 0x7f702287a537)
unknown function (ip: 0x7f7022873c56)
unknown function (ip: 0x7f702286603e)
unknown function (ip: 0x7f702286842a)
unknown function (ip: 0x7f70228685fc)
jl_trampoline at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
anonymous at /home/nstiurca/.julia/v0.4/OpenCL/test/test_behaviour.jl:267
facts at /home/nstiurca/.julia/v0.4/FactCheck/src/FactCheck.jl:448
anonymous at /home/nstiurca/.julia/v0.4/OpenCL/test/test_behaviour.jl:246
unknown function (ip: 0x7f70228a9c4b)
unknown function (ip: 0x7f70228aa879)
jl_load at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:304
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
unknown function (ip: 0x7f70228956b3)
unknown function (ip: 0x7f7022894af1)
unknown function (ip: 0x7f70228a9b88)
unknown function (ip: 0x7f70228aa2f2)
unknown function (ip: 0x7f70228a9f45)
unknown function (ip: 0x7f70228aa879)
jl_load at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:304
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
process_options at ./client.jl:284
_start at ./client.jl:378
unknown function (ip: 0x7f701f8d5329)
jl_apply_generic at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)
unknown function (ip: 0x401b09)
unknown function (ip: 0x4016df)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401725)
unknown function (ip: (nil))
==============================================================================================================[ ERROR: OpenCL ]==============================================================================================================

failed process: Process(`/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/nstiurca/.julia/v0.4/OpenCL/test/runtests.jl`, ProcessSignaled(11)) [0]

=============================================================================================================================================================================================================================================
INFO: No packages to install, update or remove
ERROR: OpenCL had test errors
 in error at ./error.jl:21
 in test at pkg/entry.jl:803
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in test at pkg.jl:71

I'm not sure how to approach debugging/fixing this, so any ideas welcome.

PS: do we care about OpenCL support for now?

Stack overflow on show/print of program

I found this error when I run the README's example on Ubuntu 13.10 with a AMD GPU. The only error is for show the program, but the example is successful :)

julia> p = cl.Program(ctx, source=sum_kernel) |> cl.build!
Evaluation succeeded, but an error occurred while showing value of type Program:
ERROR: stack overflow
 in show at /home/dzea/.julia/v0.2/OpenCL/src/program.jl:19

Installation of OpneCL in Travis

I'm trying to fix tests on Travis for CLBLAS.jl, which requires OpenCL to be installed. This package passes tests without any problem, but I can't grasp what in .travis.yml (or elsewhere) ensures OpenCL is in place. Any hints?

Unhandled task failure on v0.5

I have a 5K retina iMac with OpenCL.Device(AMD Radeon R9 M295X Compute Engine on Apple @0x0000000001021c00). I've started getting errors when I build kernels:

ERROR (unhandled task failure): OpenCL Error: OpenCL.Context error: ȧJ?
in raise_context_error(::String, ::String) at /Users/sjbespa/.julia/v0.5/OpenCL/src/context.jl:56
in macro expansion at /Users/sjbespa/.julia/v0.5/OpenCL/src/context.jl:97 [inlined]
in (::OpenCL.##45#48)() at ./task.jl:309

Running Pkg.test("OpenCL") results in about 1/2 dozen of the errors... with julia updated this morning.

Apparent OpenCL.read error

Consider the code:

using OpenCL
device, context, queue = OpenCL.create_compute_context()

a = rand(Float32, 125356789)
abuf = OpenCL.Buffer(Float32, context, (:r, :copy), hostbuf=a)
b = OpenCL.read(queue, abuf)
isapprox(a, b)

You can clearly see that upon reading abuf back into host memory, that the last 50 or so coordinates are now zeroed out. The vector a is about half a gigabyte, which is well within both my host and device's memory, what is going on here? Is this something caused by the wrapper or is it a bug in OpenCL itself?

Memory leak

Hey,

I am encountering a memory leak using OpenCL.jl

Execute the following in the REPL

import OpenCL; const cl = OpenCL
function leaky()
          device, ctx, queue = cl.create_compute_context()
          # Allocate buffer
          a = cl.Buffer(Float64, ctx, :rw, 1024*1024)
          cl.flush(queue)
          cl.release!(queue)
          cl.release!(ctx)
end

for i in 1:100000
   leaky()
end

gc()

Monitoring my process with htop I can see that Julia is now using ~400mb permanently, running the loop a second time results in an increase in memory of ~300mb.
This 300mb step is consistent over consecutive repetitions.

This happens to me both with the OpenCL implementation by Intel (CPU) and by Nvidia (although here I don't have any number available).

I am running Julia version 0.3.0-prerelease+2000 Commit 48ac426* (1 day old master) and OpenCL.jl current master.

If I try to manually release the Buffer with cl.release! I get an error about double freeing the Buffer.

Discussion on the mailing list: https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/SRz6OkQ5hMw

Best,
Valentin

Compilation Error

Hi,

I'm using Intel OpenCL drivers on win7 machine. When i try to run the example program given, seeing the following error:

"error compiling create_some_context: error compiling platforms: could not load module libOpenCL: The specified module could not be found."

Pls let me know if you need additional information.

Thanks.

Rename inner `api` module to `Api`

Modules by convention always start with an uppercase in Julia, so this has always felt out of place. This would be a breaking change so maybe it is not worth it.

Use 0.4 machinery

After dropping support for v0.3 we should take advantage of 0.4 using Refs instead of arrays of size one and call overloading are obvious canidates. Anything else?

Test fail on Intel CPU

Running the tests on 76ae49b results on the following output on the Intel SDK OpenCL implementation.

OpenCL.Event 

Failure :: (line:33) :: OpenCL.Event wait :: got :submitted
:(mkr_evt[:status]) => :(:queued)

Failure :: (line:34) :: OpenCL.Event wait :: got 0x00000002
:(cl.cl_event_status(mkr_evt[:status])) => :(cl.CL_QUEUED)

Test Callback
Out of 16 total facts:
  Verified: 14
  Failed:   2
  Errored:  0

Does this package support Float64?

I'm trying to run example from README with float replaced by double in kernel (and Float32 replaced by Floa64 accordingly). On line:

p = cl.Program(ctx, source=sum_kernel) |> cl.build!

I'm getting an error:

ERROR: LoadError: CLError(code=-11, CL_BUILD_PROGRAM_FAILURE)
  [inlined code] from /home/<username>/.julia/v0.4/OpenCL/src/macros.jl:6
  in build! at /home/<username>/.julia/v0.4/OpenCL/src/program.jl:81
  in |> at operators.jl:198

As far as I understand, support for doubles should be enabled separately in OpenCL, and I can see examples how to do it in C. But is it possible / needed to do it in OpenCL.jl?

Note: I'm using NVidia's GeForce GT 630M video card which itself does support Float64 (as verified by analogous CUDA program).

Serialization of OpenCL pointers

I am trying to use OpenCL.jl in a k-fold crossvalidation scheme. In this scheme, the GPU will house some data that are common to all k folds. I have a wrapper function that loads all of the data, allocates the (constant) GPU buffers, and compiles the OpenCL program/kernels. The wrapper then uses @spawn to farm each fold to its own CPU core. Each fold will then use the (constant) GPU data, plus some GPU buffers private to the fold.

When I tried to run this code, I ran into an unusual error:

ERROR: LoadError: cannot serialize a pointer
in serialize at serialize.jl:418
in serialize at serialize.jl:127
in serialize at serialize.jl:300
in serialize at serialize.jl:418
in send_msg_ at multi.jl:222
in remotecall at multi.jl:710
in remotecall at multi.jl:714
...

Something like this has arisen in other parallel computing schemes before; see JuliaLang/julia#5954, JuliaOpt/NLopt.jl#12, JuliaLang/julia#3643.

I think that the problem is that the pointers to the GPU buffers are not automatically serialized. Is this difficult to do? I would do it myself, but I am clueless about what the serialization of a pointer actually means. :-(

Cssize_t and Julia 0.4

Hi,
I just recently noted, that the conversion in types.jl:143 is wrong, as Julia now throws an error, if something overflows.
cl_context_properties(x) = convert(CL_context_properties, x)
Should be rather something like this:
cl_context_properties(x::Ptr{Void}) = reinterpret(CL_context_properties, x) or in 0.4
cl_context_properties(x::Integer) = uint64(x) % CL_context_properties
As intptr_t actually allows for overflow, as long as it can be transferred back to the same Ptr{Void}.
I haven't opened a pull request, as I first want to discuss this, and search for further similar issues.
Best,
Simon

Documentation

Now that we have a shiny new documentation system.

Image support

Thanks for the great OpenCL package, for me it works out of the box on Windows 8.1 (using CUDA) and Ubuntu 14.04 (using Intel OpenCL and the AMD APP SDK).

I have used PyOpenCL before for GPU image processing and a typical code to create a RGBA image buffer on the device using PyOpenCL is this

    iform = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
    im_buffer = im.tostring()
    cl_img = cl.Image(context, cl.mem_flags.READ_ONLY |
        cl.mem_flags.COPY_HOST_PTR, iform, self.im.size, None, im_buffer)

I have not had success in trying to load an image onto the device using OpenCL.jl. I can create the image format object using

using Images
import OpenCL
const cl = OpenCL

function load_image(image_pathname)
    img = imread(image_pathname)
    img = convert(Array, img)
    img
end

img = load_image("test.jpg")

device = first(cl.devices())
iform = cl.CL_image_format(cl.CL_RGBA, cl.CL_UNSIGNED_INT8)

But I cannot find an image class as in PyOpenCL.

If it is already supported, can you create a simple example that loads an image onto the device, please?

CLArray testing problem

Besides the OOM problem on travis I am also running into this:

  > OpenCL.CLArray transpose
ERROR: LoadError: LoadError: CLError(code=-30, CL_INVALID_VALUE)
 [inlined code] from /home/wallnuss/.julia/v0.4/OpenCL/src/macros.jl:6
 in enqueue_kernel at /home/wallnuss/.julia/v0.4/OpenCL/src/kernel.jl:5
 in transpose! at /home/wallnuss/.julia/v0.4/OpenCL/src/array.jl:106
 in anonymous at /home/wallnuss/.julia/v0.4/OpenCL/test/test_array.jl:59

on a OpenCL.Device(NVS 4200M on NVIDIA CUDA)

OpenCL 1.2 Functions

Throw an error when trying to access v1.2 functions with platforms that don't support them.

Event based OpenCL programming

So I am fairly new to OpenCL, but as I kept on reading best practices it seemed to me that one should definitely use an event based programming flow to optimize the usage of the compute device.

From what I can see julias RemoteRef also encapsulated an event based programming style. (At least I got the impression)

Currently in OpenCL.jl call is a blocking operation. I would propose that we use RemoteRefs to encapsulate Buffers and hide the complexity of the event based computation.

My dream would be to write something like this.

A = Buffer(Float64, ctx, :rw, 1024)
B = Buffer(Float64, ctx, :rw, 1024)
C = Buffer(Float64, ctx, :rw, 1024)

fill!(out=A, 3.0) # A isready iff when this operation is done.

op_a!(out=B, A) # A is an input Buffer and thus op_a! has to wait on fill!
op_a!(out=C, A) # also depended on A but since A is still in a read position, those two are independent on each other

sum!(out=A, B, C) # Waits on A, B, C since A is an output all previous reads have to be finished before A can be written again and B, C have to be fully written before being used as inputs.

It depends on have a semantic for issuing kernel calls, but OpenCL already has this as __global const A is an input buffer and __global B is an output buffer.

@jakebolewski what do you think?

Program build_log is broken

Lines 122 and 129 in program.jl are broken. The calls to clGetProgramBuildInfo should be qualified by 'api' instead of 'cl'.

Program and Kernel are leaking memory

This issue is slightly different from #21 and to minimize the confusion I decided to open up a second issue.

The following code leaks memory:
Note: The Buffer creation is only there to make it abundantly clear when you look at the memory output.

import OpenCL; const cl = OpenCL
size = 1024
function leaky()
          device, ctx, queue = cl.create_compute_context()
          # Allocate buffer
          a = cl.Buffer(Float64, ctx, :rw, size*size)
          b = cl.Buffer(Float64, ctx, :rw, size*size)
          out = cl.Buffer(Float64, ctx, :rw, size*size)
          p = cl.Program(ctx, source=simpleKernel) |> cl.build!
          return nothing
end

function leak()
    for i in 1:1000
       leaky()
    end
end


simpleKernel =  "
        #define number float

        __kernel void add(
                      __global const number *a,
                      __global const number *b,
                      __global number *out) {

        int i = get_global_id(0);

        out[i] = a[i] + b[i];
    }
"

Substituting leaky with one of the following fixes the memory leak

function leaky()
          ....
          p = cl.Program(ctx, source=simpleKernel)
          return nothing
end

or

function leaky()
          ...
          p = cl.Program(ctx, source=simpleKernel) |> cl.build!
          cl.release!(p)
          return nothing
end

Adding a Kernel into the mix creates the leak again:

function leaky()
          ...
          p = cl.Program(ctx, source=simpleKernel) |> cl.build!
          k = cl.Kernel(p, "add")
          cl.release!(p)
          return nothing
end

and manually releasing the Kernel fixes the memory leak

function leaky()
          ...
          p = cl.Program(ctx, source=simpleKernel) |> cl.build!
          k = cl.Kernel(p, "add")
          cl.release!(k)
          cl.release!(p)
          return nothing
end

My intuition is that in this case we don't have the problem of #21 where we retain too often, but something different going on.

Please note that the last example will leak memory without the changes in 721c962 .

Since the memory leak only appears for me when the program is build I thought that maybe clProgramBuild increase the ref count by one, but issuing a second clReleaseProgram when the program is build resulted in a segmentation fault and since the manual release works it might have something to do with the order you release program.

OpenCL.wait and type stability

Is it possible to enforce type stability when wrapping kernel calls with OpenCL.wait? The inferred return type of OpenCL.wait is Union{Array{OpenCL.CLEvent,1},OpenCL.CLEvent}.

I can try to illustrate using a part of my code. The PLINK.PlinkGPUVariables type is a container object with the GPU buffer y_buff as one field. copy_y! simply copies a CPU SharedVector y to y_buff. Float is a typealias for Union{Float32, Float64}.

Function

function copy_y!{T <: Float}(v::PlinkGPUVariables{T}, y::SharedVector{T})
    cl.wait(cl.copy!(v.queue, v.y_buff, sdata(y)))
end

Check type stability

@time 1+1 # compile @time
@code_warntype copy_y!(v, y)
copy_y!(v, y) # compile copy_y!
@time copy_y!(v, y)

Output

Variables:
  v::PLINK.PlinkGPUVariables{Float64,OpenCL.Buffer{Float64}}
  y::SharedArray{Float64,1}

Body:
  begin  # /Users/kkeys/.julia/v0.4/PLINK/src/gpu.jl, line 231:
      return (OpenCL.wait)((OpenCL.copy!)((top(getfield))(v::PLINK.PlinkGPUVariables{Float64,OpenCL.Buffer{Float64}},:queue)::OpenCL.CmdQueue,(top(getfield))(v::PLINK.PlinkGPUVariables{Float64,OpenCL.Buffer{Float64}},:y_buff)::OpenCL.Buffer{Float64},(top(getfield))(y::SharedArray{Float64,1},:s)::Array{Float64,1})::ANY)::UNION{ARRAY{OPENCL.CLEVENT,1},OPENCL.CLEVENT}
end::UNION{ARRAY{OPENCL.CLEVENT,1},OPENCL.CLEVENT}
  0.000478 seconds (9 allocations: 288 bytes)

`payload` not defined in `ctx_notify_err()`

Probably just a typo, but still better to track it somewhere until somebody finds time to fix it.

From build on Travis (Julia 0.4 on Mac):

Error :: (line:505) :: OpenCL.CLArray core functions
      Expression: cl.to_host(A') --> cl.to_host(B)
      UndefVarError: payload not defined
       in ctx_notify_err at /Users/travis/.julia/v0.4/OpenCL/src/context.jl:41
       in clEnqueueNDRangeKernel at /Users/travis/.julia/v0.4/OpenCL/src/api.jl:22
       [inlined code] from /Users/travis/.julia/v0.4/OpenCL/src/macros.jl:4
       in enqueue_kernel at /Users/travis/.julia/v0.4/OpenCL/src/kernel.jl:191
       in transpose! at /Users/travis/.julia/v0.4/OpenCL/src/array.jl:106
       in ctranspose at operators.jl:159
       in anonymous at /Users/travis/.julia/v0.4/FactCheck/src/FactCheck.jl:271
       in do_fact at /Users/travis/.julia/v0.4/FactCheck/src/FactCheck.jl:333
       [inlined code] from /Users/travis/.julia/v0.4/FactCheck/src/FactCheck.jl:271
       in anonymous at /Users/travis/.julia/v0.4/OpenCL/test/test_array.jl:41
       in context at /Users/travis/.julia/v0.4/FactCheck/src/FactCheck.jl:474
       in anonymous at /Users/travis/.julia/v0.4/OpenCL/test/test_array.jl:40
       in facts at /Users/travis/.julia/v0.4/FactCheck/src/FactCheck.jl:448
       in include at /Users/travis/julia/lib/julia/sys.dylib
       in include_from_node1 at /Users/travis/julia/lib/julia/sys.dylib
       in include at /Users/travis/julia/lib/julia/sys.dylib
       in include_from_node1 at /Users/travis/julia/lib/julia/sys.dylib
       in process_options at /Users/travis/julia/lib/julia/sys.dylib
       in _start at /Users/travis/julia/lib/julia/sys.dylib

Which leads us to this line.

Support of Julia v0.2

Since Travis is now only testing v0.3 and v0.4-dev we should probably branch of a stable release for v0.2 and make v0.3 a requirment. I would do this before we introduce any new changes.

@jakebolewski Any thoughts?

`performance.jl` on [Intel(R) Core(TM)2 Duo CPU L9600 @ 2.13GHz] [Mac OS X 10.7.5]

The performance.jl example is getting an invalid work group size error message on an Intel(R) Core(TM)2 Duo CPU L9600 @ 2.13GHz under Mac OS X 10.7.5:

> julia performance.jl
Size of test data: 128 MB
Julia Execution time: 0.2926 seconds
====================================================
Platform name:    Apple
Platform profile: FULL_PROFILE
Platform vendor:  Apple
Platform version: OpenCL 1.1 (Aug 10 2012 19:59:48)
----------------------------------------------------
Device name: Intel(R) Core(TM)2 Duo CPU L9600 @ 2.13GHz
Device type: cpu
Device mem: 4096 MB
Device max mem alloc: 1024 MB
Device max clock freq: 2130 MHZ
Device max compute units: 2
Device max work group size: 1024
Device max work item size: (1024,1,1)
ERROR: OpenCL.Context error: [CL_INVALID_WORK_GROUP_SIZE] : OpenCL Error : clEnqueueNDRangeKernel failed: total work group size (256) is greater than the device can support (128)
 in raise_context_error at /Users/lucas/.julia/OpenCL/src/context.jl:46
 in ctx_notify_err at /Users/lucas/.julia/OpenCL/src/context.jl:38
 in clEnqueueNDRangeKernel at /Users/lucas/.julia/OpenCL/src/api.jl:18
 in enqueue_kernel at /Users/lucas/.julia/OpenCL/src/kernel.jl:26
 in call at /Users/lucas/.julia/OpenCL/src/kernel.jl:137
 in cl_performance at /Users/lucas/.julia/OpenCL/examples/performance.jl:86
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in process_options at client.jl:303
 in _start at client.jl:389
at /Users/lucas/.julia/OpenCL/examples/performance.jl:98

Error in Buffer for Julia Set example

I get this ERROR with a AMD Radeon™ HD 7870 [ http://www.amd.com/us/products/desktop/graphics/7000/7870/Pages/radeon-7870.aspx#3 ] on Ubuntu 13.10

julia> function julia_opencl(q::Array{Complex64}, maxiter::Int64)
           ctx   = cl.Context(cl.devices()[1])
           queue = cl.CmdQueue(ctx)

           out = Array(Uint16, size(q))

           q_buff = cl.Buffer(Complex64, ctx, (:r, :copy), hostbuf=q)
           o_buff = cl.Buffer(Uint16, ctx, :w, sizeof(out))

           prg = cl.Program(ctx, source=julia_source) |> cl.build!
           k = cl.Kernel(prg, "julia")

           cl.call(queue, k, length(q), nothing, q_buff, o_buff, uint16(maxiter))
           cl.copy!(queue, out, o_buff)

           return out
       end
julia_opencl (generic function with 1 method)

julia> @time m = julia_opencl(q, 200);
ERROR: CLError(code=-61, CL_INVALID_BUFFER_SIZE)
 in Buffer at /home/dzea/.julia/v0.2/OpenCL/src/buffer.jl:121
 in Buffer at /home/dzea/.julia/v0.2/OpenCL/src/buffer.jl:78
 in Buffer at /home/dzea/.julia/v0.2/OpenCL/src/buffer.jl:44
 in julia_opencl at none:7

CL_PLATFORM_NOT_FOUND_KHR

I am on Antergos Linux:

using OpenCL
OpenCL.create_compute_context()

ERROR: CLError(code=-1001, CL_PLATFORM_NOT_FOUND_KHR)
[inlined code] from /home/juliohm/.julia/v0.4/OpenCL/src/macros.jl:6
in platforms at /home/juliohm/.julia/v0.4/OpenCL/src/platform.jl:21
in create_some_context at /home/juliohm/.julia/v0.4/OpenCL/src/context.jl:207
in create_compute_context at /home/juliohm/.julia/v0.4/OpenCL/src/util.jl:2

The hardware I have:

lspci

00:00.0 Host bridge: Intel Corporation Skylake Host Bridge/DRAM Registers (rev 08)
00:02.0 VGA compatible controller: Intel Corporation Skylake Integrated Graphics (rev 07)
00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Device 9d10 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Device 9d18 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-LP LPC Controller (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
3a:00.0 Network controller: Broadcom Corporation BCM4350 802.11ac Wireless Network Adapter (rev 08)
3b:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
3c:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller (rev 01)

Data transfer from OpenCl and vice versa

As far as I now Julia uses the Fortran convention of storing Arrays in a column major format. OpenCL on the other hand comes from a C tradition with a row major format. Altough OpenCL does not support multidimensional arrays.

I normally have OpenCl code like this

#define A(x,y) a[x*D1 + y]
#define B(x,y) a[x*D1 + y]
#define C(x,y) a[x*D1 + y]

__kernel void add(
                              __global const float *a,
                              __global const float *b,
                              __global  float *c,) {
        int i = get_global_id(0);
        int j = get_global_id(1);
        C(i, j) = A(i, j) + B(i, j);
        }

My question/issue is the following:

When I copy an array from Julia to a Buffer the buffer itself is still in column-major format, right?

In the code above it doesn't matter much but if you do more interesting stuff like looking at the moore neighborhood and you port your code straight from julia to OpenCL you run into issues.

As an example: with

#define A(x,y) a[x*D1 + y]
M_pot1 - M_pot1CL sumabs = 1398.337481118991
W_pot - W_potCL sumabs = 20405.633784834685
WARNING: M_pot and M_potCL diverge on 1600 points
WARNING: W_pot and W_potCL diverge on 1600 points

and when I change the access pattern to column-major

#define A(x,y) a[y*D2 + x]
M_pot1 - M_pot1CL sumabs = 298.57719254903964
W_pot - W_potCL sumabs = 21340.42537595309
WARNING: M_pot and M_potCL diverge on 373 points
WARNING: W_pot and W_potCL diverge on 373 points

Comment : The rest of divergence comes either from FP-math or an error on myside.

How do you deal with that issue when you generate code?
What happens when I create a buffer on the GPU fill it there and copy it back into Julia in what order do they arrive?

It would be nice if you could provide an extended explanation or documentation of that issue.

Otherwise I just love OpenCL.jl and I look forward to the moment where we can generate OpenCL programs from Julia code.

Current Event callback uses undefined behaviour

I learned today [1] that the solution I implemented for user callbacks in [2] is using undefined/illegal behaviour.
I am quite amazed that it worked as long as it did...

code_llvm(OpenCL.event_notify, (OpenCL.CL_event, OpenCL.CL_int, Ptr{Void}))

%7 = load %jl_value_t*** @jl_pgcstack, align 8

I am not quite sure how to solve that problem without reimplementing Base.SingleAsyncWork since we can't pass along evt_id::CL_event, status::CL_int.

@jakebolewski Any ideas?

[1] https://groups.google.com/forum/#!topic/julia-dev/9J1GYfCyVpE
[2]

OpenCL.jl/src/event.jl

Lines 97 to 106 in f529546

function event_notify(evt_id::CL_event, status::CL_int, julia_func::Ptr{Void})
# Obtain the Function object from the opaque pointer
callback = unsafe_pointer_to_objref(julia_func)::Function
# In order to callback into the Julia thread create an AsyncWork package.
cb_packaged = Base.SingleAsyncWork(data -> callback(evt_id, status))
# Use uv_async_send to notify the main thread
ccall(:uv_async_send, Void, (Ptr{Void},), cb_packaged.handle)
end

Build on OS X 10.7.5

Hi,

Thanks for adding the changes to support OpenCL 1.1. It has improved the number of passing tests.

Please find below the output from ./run_tests on the OS X 10.7.5 (OpenCL 1.1) from version 809d438. I can look at the issues more as I get time.

> ./run_tests.sh

OpenCL.Platform

13 facts verified.


OpenCL.Context

19 facts verified.


OpenCL.Device

141 facts verified.


OpenCL.CmdQueue

Failure :: (line:25) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,flag)) => :(false,"no error")

Failure :: (line:26) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,device,flag)) => :(false,"no error")

Failure :: (line:25) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,flag)) => :(false,"no error")

Failure :: (line:26) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,device,flag)) => :(false,"no error")

Failure :: (line:25) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,flag)) => :(false,"no error")

Failure :: (line:26) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,device,flag)) => :(false,"no error")

Failure :: (line:25) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,flag)) => :(false,"no error")

Failure :: (line:26) :: OpenCL.CmdQueue constructor :: got (true,"error")
:(@throws_pred cl.CmdQueue(ctx,device,flag)) => :(false,"no error")

Out of 53 total facts:
  Verified: 45
  Failed:   8
  Errored:  0


OpenCL.Event

INFO: skipping user event callback for Apple version < 1.2
10 facts verified.


OpenCL.Buffer

Failure :: (line:69) :: OpenCL.Buffer constructors :: got (false,"no error")
:(@throws_pred cl.Buffer(Float32,ctx,|(cl.CL_MEM_USE_HOST_PTR,cl.CL_MEM_ALLOC_HOST_PTR),hostbuf=testarray)) => :(true,"error")

Failure :: (line:69) :: OpenCL.Buffer constructors :: got (false,"no error")
:(@throws_pred cl.Buffer(Float32,ctx,|(cl.CL_MEM_USE_HOST_PTR,cl.CL_MEM_ALLOC_HOST_PTR),hostbuf=testarray)) => :(true,"error")

Out of 444 total facts:
  Verified: 442
  Failed:   2
  Errored:  0


OpenCL.Program

34 facts verified.


OpenCL.Kernel

ERROR: OpenCL.Context error: OpenCL Build Warning : Compiler build log:
<program source>:8:15: warning: comparison of integers of different signs: 'int' and 'const unsigned int'
      if (gid < count) {
          ~~~ ^ ~~~~~


 in raise_context_error at /Users/lucas/.julia/OpenCL/src/context.jl:46
 in ctx_notify_err at /Users/lucas/.julia/OpenCL/src/context.jl:38
 in build! at /Users/lucas/.julia/OpenCL/src/program.jl:26
 in anonymous at no file:36
 in context at /Users/lucas/.julia/FactCheck/src/FactCheck.jl:300
 in anonymous at no file:26
 in facts at /Users/lucas/.julia/FactCheck/src/FactCheck.jl:320
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in process_options at client.jl:303
 in _start at client.jl:389
at /Users/lucas/.julia/OpenCL/test/test_kernel.jl:163

ERROR: failed process: Process(`julia /Users/lucas/.julia/OpenCL/test/test_kernel.jl`, ProcessExited(1)) [1]
 in pipeline_error at process.jl:476
 in run at process.jl:453
 in anonymous at no file:5
 in include at boot.jl:238
 in include_from_node1 at loading.jl:114
 in process_options at client.jl:303
 in _start at client.jl:389
at /Users/lucas/.julia/OpenCL/test/runtests.jl:9

./run_tests.sh  73.30s user 1.96s system 99% cpu 1:15.54 total
exit 1
[809d4383e542  master ~/.julia/OpenCL]
exit 1
[809d4383e542  master ~/.julia/OpenCL]
exit 1

Performance compared to ordinary Julia arrays

Let's take just a little bit modified example from README:

const sum_kernel = "
   __kernel void sum(__global const float *a,
                     __global float *b)
{
      int gid = get_global_id(0);
      b[gid] = a[gid] + b[gid];
}
"
a = rand(Float32, 50_000)
b = rand(Float32, 50_000)

device, ctx, queue = cl.create_compute_context()

a_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=a)
b_buff = cl.Buffer(Float32, ctx, (:rw, :copy), hostbuf=b)    

p = cl.Program(ctx, source=sum_kernel) |> cl.build!
k = cl.Kernel(p, "sum")
cl.set_args!(k, a_buff, b_buff)

Running this example on my machine like this:

ev = cl.enqueue_kernel(queue, k, size(a))  # to create ev in global scope
@time begin
    for i=1:10_000
        ev = cl.enqueue_kernel(queue, k, size(a))
    end
    cl.wait(ev)
end

takes 0.328125 seconds (and 5Mb). At the same time doing the same thing for native Julia arrays:

@time for i=1:10_000
     sum!(a, b)
end

it takes 0.251480 seconds (and 780Kb).

Summing arrays of numbers looks like a perfect use case for GPU computing. Yet, it's worse than ordinary addition performed on CPU.

Are these results specific to my setup only or this is something expectable?


Device: GeForce GT 630M on NVIDIA CUDA
Driver: nvidia-352.63

Compiling Julia to OpenCL SPIR instead of OpenCL C

Instead of targeting OpenCL C as an intermediate language I would be interested in trying to directly target SPIR. SPIR is proposed as a common IR for OpenCL and is currently supported by AMD and Intel. NVIDIA support is still outstanding because it needs atleast OpenCL 1.2, but it seems there is movement on that front. (CUDA 6 has stubs for OCL12 functions)

An example of how SPIR looks like can be found here:
http://streamcomputing.eu/blog/2013-12-27/opencl-spir-by-example/
and the offical webpage is here: https://www.khronos.org/spir

Since Julia compiles to llvm IR and SPIR is an extension of that it might be feasible to extend the Julia compile process instead of creating a Julia to OpenCL C compiler.

@jakebolewski Do you think that might be a sensible project?

Some more information can be found here: http://www.slideshare.net/DevCentralAMD/pl-4051-yaxunliu especially mapping of OpenCL C types to llvm types

So for me it seems that the necessary steps would be:

  • supply julia idioms for OpenCL function like get_global_id
  • Translate Julia to SPIR (llvm IR with certain restrictions)
  • Use llvm's VerifySPIR stage to ensure validity
  • obtain binary by either using llvm directly or via clBuildProgram("-x spir -spir-std=1.2", ....)
  • then one should be able to use the program as currently in OpenCL.jl

Notes:

  • use clang -S -x cl -fno-builtin -target spir -emit-llvm -c $filename to obtain SPIR for a OpenCL source file.

`enqueue_write_buffer` and `enqueue_read_buffer` arguments

I was wondering as the reason for enqueue_write_buffer and enqueue_read_buffer taking different arguments, i.e., why does enqueue_write_buffer take nbytes and enqueue_read_buffer not. It seems that these should be same.

Also, currently nbytes is not used by enqueue_write_buffer, so the user does not have access to the low-level interface (instead the number of bytes is determined from the size of the host buffer).

test OpenCL.kernel out of host memory

I have Linux Mint 17 with a NVIDIA GeForce 8600M GT. Under Windows, GPU Caps shows that an OpenCL 1.1 interface available, although 1.0 is reported as the actual HW capability. On Linux I get similar reportage from a SourceForge version of clinfo, even though a packaged version just barfs. I've been having trouble finding consistently working OpenCL demos / tests on either platform, so I'm not 100% sure of my driver installation. Your code seems to do a number of things just fine though. At least, ./run_tests.sh at first works fine and reports lots of "facts verified". However when it gets to OpenCL.Kernel it runs out of memory:

OpenCL.Kernel

ERROR: CLError(code=-6, CL_OUT_OF_HOST_MEMORY)
in include at ./boot.jl:245
in include_from_node1 at loading.jl:128
in process_options at ./client.jl:285
in _start at ./client.jl:354
in _start_3B_1701 at /usr/bin/../lib/x86_64-linux-gnu/julia/sys.so
while loading /home/bvanevery/.julia/v0.3/OpenCL/test/test_kernel.jl, in expression starting on line 9
while loading /home/bvanevery/.julia/v0.3/OpenCL/test/runtests.jl, in expression starting on line 4

Support for OpenCL floatN and doubleN types

It would be nice if OpenCL.jl would be able to support the halfN, floatN and doubleN types that OpenCL provides on Julias side.

Currently it is possible to initialize a double4 buffer with

    test_buff = cl.Buffer(Float64, ctx, :rw, Ndim * Mdim * 4)

But working with that data on Julias side is a hassle.

Incorrect result for simple multiplication kernel

Could you please explain why the assertion fails?

using OpenCL
const cl = OpenCL

dev, ctx, queue = cl.create_compute_context()

const mult_kernel = "
  __kernel void mult(__global const float2 *a,
                     __global const float2 *b,
                     __global float2 *c)
  {
    int gid = get_global_id(0);
    c[gid].x = a[gid].x*b[gid].x - a[gid].y*b[gid].y;
    c[gid].y = a[gid].x*b[gid].y + a[gid].y*b[gid].x;
  }
"

prog = cl.Program(ctx, source=mult_kernel) |> cl.build!
k_mult = cl.Kernel(prog, "mult")

A = fill(Complex64(1+im), 100, 100)
bufA = cl.Buffer(Complex64, ctx, :copy, hostbuf=A)
bufRES = cl.Buffer(Complex64, ctx, :copy, hostbuf=A)

cl.call(queue, k_mult, size(A), nothing, bufA, bufA, bufRES)
result = reshape(cl.read(queue, bufRES), size(A))

# result should be equal to A.*A, in this case all entries equal to 2im
@assert all(result .== Complex64(2im))

Bug in examples\hands_on_opencl\ex08\matmul.jl

Hi, thanks for the package!

I think that the file examples\hands_on_opencl\ex08\matmul.jl has the following bug:

It says:

#--------------------------------------------------------------------------------
#OpenCL matrix multiplication ... C row per work item, A row pivate, B col local 
#--------------------------------------------------------------------------------
kernel_source = open(readall, joinpath(src_dir, "C_block_form.cl"))
...
    evt = cl.call(queue, mmul, (Ndim,), (int(ORDER/16),),
                  int32(Mdim), int32(Ndim), int32(Pdim),
                  d_a, d_b, d_c, localmem1, localmem2)

it should be:

#--------------------------------------------------------------------------------
#OpenCL matrix multiplication ... blocked
#--------------------------------------------------------------------------------
kernel_source = open(readall, joinpath(src_dir, "C_block_form.cl"))
...
    evt = cl.call(queue, mmul, (Ndim,Mdim), (blocksize,blocksize),                        
                  int32(Mdim), int32(Ndim), int32(Pdim),
                  d_a, d_b, d_c, localmem1, localmem2)

Also a suggestion. In the examples, the matrices are stored in row-major order. As you do in python numpy. So that A[i,k] = h_A [ (i-1)*Ndim + k] (if the first index is 1)

But in julia, the arrays are sorted in column-major order. So A[i,k] = h_A[i + (k-1)*Ndim].

In the examples there is not problem because the matrices are NxN and store the same values (AVAL, BVAL). But with different dimensions and/or stored values, it could be a problem.

To work nicely with the julia reshape function back and forth, the arrays in the OpenCl kernels should be read h_A [i + k*Ndim] (first index is 0), instead of h_A[i*Ndim + k]

Does it make sense?

Note: the github formatting is taking out a couple of [ ] parenthesis

~Edit(@vchuravy) added github formatting

Splitting up API

To address one of my concerns in #39 I would propose to only load the OpenCL API depending on the maximum version available on the machine.

So I would change the api loading in the following way.

  • bootstrap.jl -> Minimum bases for finding all platforms and checking the supported version of the platform.
  • api/opencl10.jl contains all undepreciated api calls from version 1.0
  • api/opencl10-depreciated11.jl contains all api calls from version 1.0 that were depreciated in version 1.1
  • api/opencl10-depreciated12.jl contains all api calls from version 1.0 that were depreciated in version 1.2
  • api/opencl11.jl includes opencl10.jl
  • etc. pp.
module OpenCL
    include ("bootstrap.jl")

    global OPENCL_VERSION :: VersionNumber

    function __init__()
        global OPENCL_VERSION = maximum(map(cl.opencl_version, cl.platforms())

        if OPENCL_VERSION == v"1.2"
            include("api/opencl12.jl")
        elseif OPENCL_VERSION == v"1.1"
            include("api/opencl11.jl")
            include("api/opencl11-depreciated12.jl")
        elseif
            include("api/opencl10.jl")
            include("api/opencl10-depreciated11.jl")
                include("api/opencl10-depreciated12.jl")
        end
    end

    include("...everything else")
end

Since we use macros to generate the api calls a include in a function actually works and adds the generated function i the correct scope.

This proposal would also make it easier to add OpenCL 2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.