<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

clEnqueueNDRangeKernel() failed about aparapi HOT 16 CLOSED

huixiangufl commented on July 18, 2024

clEnqueueNDRangeKernel() failed

from aparapi.

Comments (16)

GoogleCodeExporter commented on July 18, 2024

Can I just confirm that your request was for Kernel.execute(16) so the 
globalSize we pass through JNI does match your request?

It looks like the calculation we do for localSize is failing if localSize is 
small (possibly less than 64).

As a work around specify 64 as your globalSize (kernel.execute(64)) and guard 
your kernel using 

new Kernel(){
   public void run(){
      if (getGlobalId()<16){
          // your code here
      }
   }
} ;

Apologies for this.  Clearly we need some test cases for low range values.  
Note that  unless your kernel is doing a lot of work (computation + loops) it 
is unlikely that a  kernel with such a small 'range' will be very performant.

Original comment by [email protected] on 15 Nov 2011 at 4:20

Changed state: Accepted

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Nope, my call was Kernel.execute(32, 1).

It looks like the value passed through JNI is correct, but it is changed in 
line 1073.

I have tried with sizes 32,64,128,256,512 and all have the same problem, they 
show "global=x/2 local=x" in the error message.

Original comment by [email protected] on 15 Nov 2011 at 4:23

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Oh bugger.  How did that ever work ;) Let me take a closer look.

Original comment by [email protected] on 15 Nov 2011 at 4:29

from aparapi.

GoogleCodeExporter commented on July 18, 2024

So this is the remnants of me attempting to push compute across multiple 
devices.  I thought I had backed this code out before open sourcing. 

My intent was that I would dispatch half the compute to one device and half to 
another (your 6990 is seen as two separate GPU devices- you probably knew that 
already), but this required the Kernel to be very very careful and allow the 
buffers to be divided equally. 

I can fix this (i.e make it work), but I suspect that you will be dissapointed 
because the fix will mean that only one half of your GPU will be used (and any 
other dual device - I have a 5990 which I can test with here, which will 
exhibit the same error).

Clearly I have not tested enough with this card.

Original comment by [email protected] on 15 Nov 2011 at 4:36

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Here is a suggested hack.  To get you up and running 

Around line #446 in aparapi.cpp

// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
   platform = platforms[i];
   ...

Add 
   deviceIdc=1;

As the first statement in the conditional.  Giving us

// Get the # of devices
status = clGetDeviceIDs(platforms[i], deviceType, 0, NULL, &deviceIdc);
// now check if this platform supports the requested device type (GPU or CPU)
if (status == CL_SUCCESS && deviceIdc >0 ){
   deviceIdc=1;  // Hack here for issue #18
   platform = platforms[i];
   ...

Hopefully this will get you back up and running.  I need to decide whether to 
re-enable (and fix ) multiple device support or whether to remove it. This will 
need some more work.  

Again apologies for this, and also apologies that you are discovering all these 
bugs.   I do appreciate your help uncovering these.

Gary

Original comment by [email protected] on 15 Nov 2011 at 4:48

from aparapi.

GoogleCodeExporter commented on July 18, 2024

It is a new project, I do not expect it to be free from bugs.

It is also a strange field working with high-level languages and low-level 
execution, so it will take some time for a project like this to mature and 
attract users.

Anyway, I am a PhD student, so I actually get paid for trying stuff like this 
and finding/fixing errors :)

If you want to fix it, there could be an issue with uneven workloads, say 4 
devices and global/local = 5, perhaps just revert to "single unit" or something 
in this case. It is also problematic that the data needs to be copied multiple 
times, and merging back the results could be a real problem.

I will apply the idc = 1 fix and re-compile the library and test tomorrow.

Thanks for making the project open-source and actually responding to these 
reports :)

Original comment by [email protected] on 15 Nov 2011 at 6:23

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Just to add some confusion.  I tested with my 5970 (I mistyped earlier when I 
referenced a 5990) it gets detected as two devices.  It worked (but was much 
slower) when sharing execution across devices. Mandel for example was 20fps 
instead of 55fps when I applied the suggested hack above.  NBody also slowed 
considerably.  

This needs a lot of thought, I agree that non balanced workloads will be even 
more scary. 

Maybe we need to expose the devices.  So the user can request multiple devices 
if they feel that it will benefit. I really wanted to avoid this.  

I note that JOCL has a method which discovers the device with max flops.. 
Another idea might be to run in both modes (assuming I/we fix the bug ;)) 
initially and then 'learn' which is most performant.  Hmmm

Let me know if the hack above at least works for you.

Gary

Original comment by [email protected] on 15 Nov 2011 at 6:35

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Revision #110 contains the above hack if you want to try it out. 

I guarded the warning behind the -Dcom.amd.aparapi.enableVerboseJNI=true flag 

Will keep this open, because (as indicate above) this is not a fix, just a 
workaround.

Original comment by [email protected] on 15 Nov 2011 at 6:46

from aparapi.

GoogleCodeExporter commented on July 18, 2024

The workaround enables Aparapi to run the sample applications, and it is pretty 
fast on the AMD based machine, but the NVidia machine is now running slower 
than the JAVA version. The strange thing is that the JOCL version is running 
fast on both machines.

Original comment by [email protected] on 16 Nov 2011 at 2:04

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Does the NVidia machine report it's card as multiple devices? Is that why it is 
being negatively impacted by this workaround.

If so I guess we could make this 'hack' conditional? i.e only for AMD Devices 
if that helps.

Can we also confirm that the NVidia driver is OpenCL 1.1 ?

Original comment by [email protected] on 16 Nov 2011 at 5:37

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Yep, the NVidia machine reports the same "two devices", it did not work before 
the workaround, it gave the exact same error as the AMD machine.

Making the hack optional does not solve the issue, because then we go back to 
the original problem.

Yes, it reports OpenCL 1.1.

I will have a go tomorrow to try and figure out why this happens. I can compare 
the stuff done by JOCL to what Aparapi does and hopefully guess where it goes 
wrong.

Original comment by [email protected] on 16 Nov 2011 at 6:40

from aparapi.

GoogleCodeExporter commented on July 18, 2024

After running some more tests, I can see that the NVidia machine does in fact 
offer a speedup.

On the AMD machine, the speedup obtained through Aparapi and JOCL is pretty 
much the same, with JOCL only being slightly faster (~2%).

On the NVidia machine the difference is much larger (~40%). After scaling the 
problem to a suitable size, there is a clear performance gain using either 
method though. So the hack does work correctly on the NVidia machine as well.

Looking at the generated OpenCL code, there is really no difference from the 
hand-generated OpenCL, except that the Aparapi version uses a few local 
variables. But this is not really related to the original issue though, and is 
likely just some special case where the NVidia kernel is slower.

Original comment by [email protected] on 25 Nov 2011 at 12:26

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Kenneth I think the recent Range related changes should have fixed this.  

Can you confirm for me.  

Gary

Original comment by [email protected] on 23 Feb 2012 at 8:13

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Based on final comment, and the fact that the last activity was over a year 
ago, this issue may likely be closed.

Original comment by [email protected] on 29 Mar 2013 at 11:35

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Yes, I think you can close it.
I do not have access to the machines that exhibited the problem anymore, so I 
cannot verify.

Original comment by [email protected] on 1 Apr 2013 at 11:19

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Original comment by [email protected] on 20 Apr 2013 at 12:31

Changed state: WontFix

from aparapi.

clEnqueueNDRangeKernel() failed about aparapi HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent