<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Original comment by frost.g...@gmail.com on 14 Feb 20

CPU still beats up GPU by 4x in biotonic sort. what could be the reason,about huixiangufl/aparapi

Comments (8)

GoogleCodeExporter commented on July 18, 2024

Vivek

Thanks for the code I think it will spawn a couple more bug reports! ;)

First your code exposed a bytecode to OpenCL conversion error for me (how did 
you run it?).  That was sad but I will add the pattern to the JUNIT test suite 
and see if I can see what is going on.  javac (Oracle) optimizes back branches 
if nested conditionals do not contain elses. 

So 
if (cond1){
   if (cond2){
     ...
   }
}else{
   if (cond3){
     ...
   }
}

I had seen this previously and *thought* I had fixed it, clearly not. 

To workaround this I just added dummy else branches to your run() method. 
Something like

if (cond1){
   if (cond2){
      ...
   }else{
      temp=temp;
   }
}else{
   if (cond3){
      ...
   }else{
      temp=temp;
   }
}

Oh and I had to initialize temp to 0;

Now the code will run using OpenCL on CPU and GPU ;)

Sadly when array length was > 2^20 (1048576 ints) the OpenCL version of the 
code was 'failing' your sanity test. Not sure why this would be.  I need to dig 
into this.  Anyway for the time being I set array length to 2^20. 

These are the #'2 I got after making these changes:
Arraylength = 1048576
SEQ:  5952 ms   // Aparapi emulating sequential code
JTP:  1869 ms   // Aparapi thread pool (I have 6 cores but aparapi only uses 4 
- power of 2)) 
CPU:  1571 ms   // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU 
using all 6 cores)
GPU:  1234 ms   // Aparapi ->OpenCL using GPU (5770) 

So GPU won for me. My GPU is a 5770.

The bitonic sort algorithm was actually the test-case that persuaded me to add 
explicit buffer management.  The nature of the bitonic sort algorithm basically 
ends up with a tight loop executing a kernel. 

for (....){
    kernel.execute(n);
}

If you look at the aparapi patterns wiki page you will see that this is the 
pattern that suggests the use of explicit buffer management. 

So the changes that I made to your code (other than a work around for the 
bytecode->opencl bug!) were 
1) sort.setExplicit(true)
2) sort.put(theArray) before entering the loop
3) sort.get(theArray) after exiting the loop



The #'s for me now are.

SEQ:  5929 ms   // Aparapi emulating sequential code
JTP:  1855 ms   // Aparapi thread pool (I have 6 cores but aparapi only uses 4 
- power of 2)) 
CPU:  1327 ms   // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU 
using all 6 cores)
GPU:   610 ms   // Aparapi ->OpenCL using GPU (5770) 

So the SEQ + JTP did not change (makes sense no opencl involved)

CPU went down a little (buffer txfer costs are NO-OPS when using OpenCL CPU) so 
I would not expect tohave had a big advantage here.

GPU was much better 2XCPU for me. 

Would you retest using the attached code (with my changes)

Original comment by [email protected] on 17 Dec 2011 at 5:03

Attachments:

BitonicSort.java

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Mr. Gary

Actual output of the initial program was this:
+------------------------------------
Initializing data...
Execution mode=GPU
retargetting 56 -> 154 to 95
retargetted 56 -> 95

 Time taken by kernel :11640 milliseconds
TEST PASSED

-----------------------------------+
As you can see the two lines extra of retargeting, it tells something about 
bug. I focused mainly on getting better results so ignored them. This was all 
running without any compilation error.

Secondly, I m still getting CPU results better than GPU, though they are almost 
equal. One reason i think is that my CPU hardware is having better performance 
specification over GPU(5470). So, nevermind GPU will eventually beat with 
better specification as yours do.

Explicit buffer really improves things a lot. And I m watching dummy else 
branches for the first time. I will do keep them in mind from now onwards. 
Anyways the results for me now are:

Array_size:     256 
SEQ:     3 ms   // Aparapi emulating sequential code
JTP:    41 ms   // Aparapi thread pool (all 4 cores)) 
CPU:   875 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:   916 ms   // Aparapi ->OpenCL using GPU (5470) 

Array_size:  524288 
SEQ:   985 ms   // Aparapi emulating sequential code
JTP:   585 ms   // Aparapi thread pool (all 4 cores)) 
CPU:  1277 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:  1388 ms   // Aparapi ->OpenCL using GPU (5470)

Array_size: 1048576 
SEQ:  2184 ms   // Aparapi emulating sequential code
JTP:  1091 ms   // Aparapi thread pool (all 4 cores)) 
CPU:  1724 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:  1956 ms   // Aparapi ->OpenCL using GPU (5470)

Array_size:33554432 
JTP: 44123 ms   // Aparapi thread pool (all 4 cores)) 
CPU: 38419 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 49322 ms   // Aparapi ->OpenCL using GPU (5470)

So finally i m getting almost equivalent results for array having length > 256 
for GPU and CPU. 


Thank you
Vivek

Original comment by [email protected] on 17 Dec 2011 at 8:08

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Vivek

So it looks like your CPU has two cores.  I think you mentioned that this is an 
Intel CPU?

I am still surprised that the GPU does not do better here.  Although I have no 
experience with the 5470.  It might not be as performant as I expect.

For the smaller array sizes the cost of bytecode -> opencl is skewing the data. 
 My guess is that this conversion is ~200ms.  Another example where small 
data/compute tests OpenCL is not efficient.

Even for me the JTP mode beats the GPU until we get to 2^16 integers.

This example code has turned out to be a good test workload.  I am seeing some 
failures on the GPU (where the  assertion that array[i-1]<=array[i] is 
failing), but only occasionally, I am converting the example now to pure OpenCL 
to see if this is a
Java/OpenCL artifact, it is weird.

Can you try without the dummy else clause to see if the bytecode to OpenCL is 
indeed OK.  The two lines of debugging (need to get rid of those) that you see 
are actually from the bug fix I added to address this nested conditional bug. 
So maybe it is working for you.

BTW what version of APP_SDK are you using?

Original comment by [email protected] on 17 Dec 2011 at 10:10

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Yes cores are 2 but with 4 threads (acc. to CPU-Z)(snap attached). Even I
see 4 different CPU Usage in windows task manager.

Well i tried running bitonic sort  in Java binding JOCL (Jogamp's)( source
code provided in their sample set), it took 14 seconds for array size
2<<19, while Aparapi using GPU taking 2 seconds with explicit buffer.

Output of it JOCL goes as follows:
+-------------------------------------------------------------------------------
------------------------------------------------
Initializing OpenCL...
Initializing OpenCL bitonic sorter...
    creating bitonic sort program
    checking minimum supported workgroup size
Creating OpenCL memory objects...
4.194304
Initializing data...

Test array length 1048576 (1 arrays in the batch)...
14619ms
1, 2, 3, 4, 8, 10, 11, 11, 14, 14, 15, 15, 15, 16, 16, 16, 17, 19, 20, 22,
...; 1048556 more

TEST PASSED
--------------------------------------------------------------------------------
----------------------------------------------+


No, first dummy else clause is required otherwise those two lines are
printed. Actually the two lines are due to first dummy else clause and that
too only with CPU and GPU. i deleted second dummy else clause only and got
no bug lines in all 4 execution mode. In short first else dummy clause is
the lead.


OpenCL 1.1 AMD-APP-SDK-v2.5 (732.1)
I m attaching the output of clinfo.

Original comment by [email protected] on 18 Dec 2011 at 5:28

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Email Attachments are skipped here. so reuploading them

Original comment by [email protected] on 18 Dec 2011 at 5:31

Attachments:

from aparapi.

GoogleCodeExporter commented on July 18, 2024

Original comment by [email protected] on 14 Feb 2012 at 5:31

Changed state: Accepted

from aparapi.

GoogleCodeExporter commented on July 18, 2024

I think we can close this. I will re-open if anyone screams.

Original comment by [email protected] on 21 Feb 2012 at 3:29

Changed state: WontFix

from aparapi.

GoogleCodeExporter commented on July 18, 2024

hi vivek I tested your implementation on i7 2600 and  gtx 480... with an array 
of 2^27 this gpu goes 17x times faster then cpu... Instead with i7 3610 and ati 
radeon 7670m cpu goes 2x faster than gpus... it's an hw problem :) ps: 
interesting implementation

Original comment by [email protected] on 17 Jan 2014 at 9:22

from aparapi.

CPU still beats up GPU by 4x in biotonic sort. what could be the reason about aparapi HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent