ysh329 / opencl-101 Goto Github PK

View Code? Open in Web Editor NEW

116.0 10.0 29.0 487 KB

Learn OpenCL step by step.

C 93.85% Shell 1.98% Python 1.29% C++ 2.88%

opencl tutorials tutorial-code guides scratch gpu-programming gpu-acceleration

opencl-101's Introduction

OpenCL-101

Learn OpenCL step by step as below.

OpenCL Examples
Installation Guide of OpenCL

1. OpenCL Examples

vec-add-standard: An standard OpenCL example, accomplish addition operation of two int-type arrays (vectors).
vec-add-simple: An simple addition operation of two vectors (NOT recommanded).
bandwidth: Measure bandwith for various variable type.
mat-transpose: An demo of matrix transpose with two OpenCL kernels.

2. Installation Guide of OpenCL

You can choose one or two ways to use OpenCL:

Install OpenCL on Ubuntu 16.04 64-bit
Using OpenCL by Docker
Using OpenCL by WSL2

2.1 Install OpenCL on Ubuntu 16.04 64-bit

# below instructions refer linux安装opencl：ubuntu14.04+opencl1.1 - qccz123456的博客 - CSDN博客
# http://blog.csdn.net/qccz123456/article/details/52606788

$ sudo apt-get update
$ sudo apt-get install build-essential g++ cmake
$ sudo apt-get install clang libclang-3.4-dev libclang-dev libclang1
$ sudo apt-get install ocl-icd-opencl-dev ocl-icd-libopencl1
$ sudo apt-get install opencl-headers ocl-icd-dev ocl-icd-libopencl1

# below instructions refer Ubuntu 16.04.2 下为 Intel 显卡启用 OpenCL_Linux教程_Linux公社-Linux系统门户网站
# http://www.linuxidc.com/Linux/2017-03/141455.htm

$ sudo apt install ocl-icd-libopencl1
$ sudo apt install opencl-headers
$ sudo apt install clinfo
$ sudo apt install ocl-icd-opencl-dev
$ sudo apt install beignet

2.2 Using OpenCL by Docker

Using Docker is convenient, which you don't need config and install enviroments for all about OpenCL. Of course, install Docker Community Edition first and then search relative images in DockerHub.

After finish Docker installation, please follow this instruction from chihchun/opencl-intel. If anything goes normally, using command below in command line:

$ docker run -t -i --device /dev/dri:/dev/dri \
chihchun/hashcat-beignet hashcat -b

It will print similar messages as Verify installation.

2.3 Verify Installation

Using instruction below, successful installation will print same following messages:

$ clinfo

# print message below

Number of platforms                               1
  Platform Name                                   Intel Gen OCL Driver
  Platform Vendor                                 Intel
  Platform Version                                OpenCL 1.2 beignet 1.1.1
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_khr_icd
  Platform Extensions function suffix             Intel

  Platform Name                                   Intel Gen OCL Driver
Number of devices                                 1
  Device Name                                     Intel(R) HD Graphics IvyBridge M GT2
  Device Vendor                                   Intel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 beignet 1.1.1
  Driver Version                                  1.1.1
  Device OpenCL C Version                         OpenCL C 1.2 beignet 1.1.1
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None, None, None
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple              16
  Preferred / native vector sizes                 
    char                                                16 / 8       
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 8        (n/a)
    float                                                4 / 4       
    double                                               0 / 2        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              2147483648 (2GiB)
  Error Correction support                        No
  Max memory allocation                           1073741824 (1024MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               65536 (64KiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      80ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 <printDeviceInfo:138: get   SPIR versions size : error -30>
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image
_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_
align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_spir cl_khr_icd

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel Gen OCL Driver
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [Intel]
  clCreateContext(NULL, ...) [default]            Success [Intel]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.8
  ICD loader Profile                              OpenCL 1.2
        NOTE:   your OpenCL library declares to support OpenCL 1.2,
                but it seems to support up to OpenCL 2.1 too.

How to compile OpenCL example in GCC?

Precisely, the kernel compilation in OpenCL is make in running time (library call).

In Gcc, for compilation, you only need the headers (aviables on Kronos site). But for linkage, you have to install OpenCL compatible driver.

in the Makefile :

for Mac OSX : -framework OpenCL
for Linux : -lOpenCL

ref: How to compile OpenCL example in GCC?
https://forums.khronos.org/showthread.php/5728-How-to-compile-OpenCL-example-in-GCC

3. Using OpenCL by WSL2

Other problems

git error: unable to auto-detect email address

yuanshuai@firefly:~/code/OpenCL-101$ git commit -m "update README.md"

*** Please tell me who you are.

Run

  git config --global user.email "[email protected]"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'yuanshuai@firefly.(none)')

After following instructions above, it still occured same error message. I reset user.email and user.name using git config --local user.email "[email protected]" and git config --local user.name "Your name" and it's okay!

ref: git中报unable to auto-detect email address 错误的解决拌办法 - liufangbaishi2014的博客 - CSDN博客 http://blog.csdn.net/liufangbaishi2014/article/details/50037507

opencl-101's People

Contributors

Stargazers

Watchers

opencl-101's Issues

Performance difference between two write methods in kernel

There're two write methods in kernel function, is there any difference between them?

First

*(__global CL_INPUT_TYPE *)(c + row * N + col) = cc1.s0;         *(__global CL_INPUT_TYPE *)(c + row * N + (col+1)) = cc1.s1;
*(__global CL_INPUT_TYPE *)(c + (row+1) * N + col) = cc2.s0;     *(__global CL_INPUT_TYPE *)(c + (row+1) * N + (col+1)) = cc2.s1;

Second

c[row * N + col] = cc1.s0;        c[row * N + (col+1)] = cc1.s1;
c[(row+1) * N + col] = cc2.s0;    c[(row+1) * N + (col+1)] = cc2.s1;

Complete Code

This is from matrix multiplication implementation, it's complete kernel code is below:

__kernel void mat_mult_vec2x2_continue(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0) << 1;
    const int row = get_global_id(1) << 1;

    CL_ELEM_TYPE aa1, aa2,
                 bb1, bb2,
                 cc1 = 0,
                 cc2 = 0;

    for (int p = 0; p < K; p+=2) {
        aa1 = *(
                   (__global CL_ELEM_TYPE *)(a + row * K + p)
               );
        aa2 = *(
                   (__global CL_ELEM_TYPE *)(a + (row+1) * K + p)
               );

        bb1 = *(
                   (__global CL_ELEM_TYPE *)(b + p * N + col)
               );  
        bb2 = *(
                   (__global CL_ELEM_TYPE *)(b + (p+1) * N + col)
               );  
    
        //cc1 = (CL_ELEM_TYPE)
        //          (aa1.s0*bb1.s0 + aa1.s1*bb2.s0,    aa1.s0*bb1.s1 + aa1.s1*bb2.s1);
        //cc2 = (CL_ELEM_TYPE)
        //          (aa2.s0*bb1.s0 + aa2.s1*bb2.s0,    aa2.s0*bb1.s1 + aa2.s1*bb2.s1);
        cc1.s0 += aa1.s0 * bb1.s0 + aa1.s1 * bb2.s0;
        cc1.s1 += aa1.s0 * bb1.s1 + aa1.s1 * bb2.s1;
        cc2.s0 += aa2.s0 * bb1.s0 + aa2.s1 * bb2.s0;
        cc2.s1 += aa2.s0 * bb1.s1 + aa2.s1 * bb2.s1;
    }

    //*(__global CL_INPUT_TYPE *)(c + row * N + col) = cc1.s0;         *(__global CL_INPUT_TYPE *)(c + row * N + (col+1)) = cc1.s1;
    //*(__global CL_INPUT_TYPE *)(c + (row+1) * N + col) = cc2.s0;     *(__global CL_INPUT_TYPE *)(c + (row+1) * N + (col+1)) = cc2.s1;

    c[row * N + col] = cc1.s0;        c[row * N + (col+1)] = cc1.s1;
    c[(row+1) * N + col] = cc2.s0;    c[(row+1) * N + (col+1)] = cc2.s1;

}

double free or corruption (!prev) Aborted (core dumped)

*** Error in `./bandwidth': double free or corruption (!prev): 0x000000001c5dd650 ***
Aborted (core dumped)

How to define the size of block?

Below is a piece of CPU GEMM(4x4_11) code.

/* Create macros so that the matrices are stored in column-major order */

#define A(i,j) a[ (j)*lda + (i) ]
#define B(i,j) b[ (j)*ldb + (i) ]
#define C(i,j) c[ (j)*ldc + (i) ]

/* Block sizes */
#define mc 256
#define kc 128

#define min( i, j ) ( (i)<(j) ? (i): (j) )

/* Routine for computing C = A * B + C */

void AddDot4x4( int, double *, int, double *, int, double *, int );

void MY_MMult( int m, int n, int k, double *a, int lda, 
                                    double *b, int ldb,
                                    double *c, int ldc )
{
  int i, j, p, pb, ib;

  /* This time, we compute a mc x n block of C by a call to the InnerKernel */

  for ( p=0; p<k; p+=kc ){
    pb = min( k-p, kc );
    for ( i=0; i<m; i+=mc ){
      ib = min( m-i, mc );
      InnerKernel( ib, n, pb, &A( i,p ), lda, &B(p, 0 ), ldb, &C( i,0 ), ldc );
    }
  }
}

void InnerKernel( int m, int n, int k, double *a, int lda, 
                                       double *b, int ldb,
                                       double *c, int ldc )
{
  int i, j;

  for ( j=0; j<n; j+=4 ){        /* Loop over the columns of C, unrolled by 4 */
    for ( i=0; i<m; i+=4 ){        /* Loop over the rows of C */
      /* Update C( i,j ), C( i,j+1 ), C( i,j+2 ), and C( i,j+3 ) in
	 one routine (four inner products) */

      AddDot4x4( k, &A( i,0 ), lda, &B( 0,j ), ldb, &C( i,j ), ldc );
    }
  }
}

#include <mmintrin.h>
#include <xmmintrin.h>  // SSE
#include <pmmintrin.h>  // SSE2
#include <emmintrin.h>  // SSE3

typedef union
{
  __m128d v;
  double d[2];
} v2df_t;

void AddDot4x4( int k, double *a, int lda,  double *b, int ldb, double *c, int ldc )
{
  /* So, this routine computes a 4x4 block of matrix A

           C( 0, 0 ), C( 0, 1 ), C( 0, 2 ), C( 0, 3 ).  
           C( 1, 0 ), C( 1, 1 ), C( 1, 2 ), C( 1, 3 ).  
           C( 2, 0 ), C( 2, 1 ), C( 2, 2 ), C( 2, 3 ).  
           C( 3, 0 ), C( 3, 1 ), C( 3, 2 ), C( 3, 3 ).  

     Notice that this routine is called with c = C( i, j ) in the
     previous routine, so these are actually the elements 

           C( i  , j ), C( i  , j+1 ), C( i  , j+2 ), C( i  , j+3 ) 
           C( i+1, j ), C( i+1, j+1 ), C( i+1, j+2 ), C( i+1, j+3 ) 
           C( i+2, j ), C( i+2, j+1 ), C( i+2, j+2 ), C( i+2, j+3 ) 
           C( i+3, j ), C( i+3, j+1 ), C( i+3, j+2 ), C( i+3, j+3 ) 
	  
     in the original matrix C 

     And now we use vector registers and instructions */

  int p;
  v2df_t
    c_00_c_10_vreg,    c_01_c_11_vreg,    c_02_c_12_vreg,    c_03_c_13_vreg,
    c_20_c_30_vreg,    c_21_c_31_vreg,    c_22_c_32_vreg,    c_23_c_33_vreg,
    a_0p_a_1p_vreg,
    a_2p_a_3p_vreg,
    b_p0_vreg, b_p1_vreg, b_p2_vreg, b_p3_vreg; 

  double 
    /* Point to the current elements in the four columns of B */
    *b_p0_pntr, *b_p1_pntr, *b_p2_pntr, *b_p3_pntr; 
    
  b_p0_pntr = &B( 0, 0 );
  b_p1_pntr = &B( 0, 1 );
  b_p2_pntr = &B( 0, 2 );
  b_p3_pntr = &B( 0, 3 );

  c_00_c_10_vreg.v = _mm_setzero_pd();   
  c_01_c_11_vreg.v = _mm_setzero_pd();
  c_02_c_12_vreg.v = _mm_setzero_pd(); 
  c_03_c_13_vreg.v = _mm_setzero_pd(); 
  c_20_c_30_vreg.v = _mm_setzero_pd();   
  c_21_c_31_vreg.v = _mm_setzero_pd();  
  c_22_c_32_vreg.v = _mm_setzero_pd();   
  c_23_c_33_vreg.v = _mm_setzero_pd(); 

  for ( p=0; p<k; p++ ){
    a_0p_a_1p_vreg.v = _mm_load_pd( (double *) &A( 0, p ) );
    a_2p_a_3p_vreg.v = _mm_load_pd( (double *) &A( 2, p ) );

    b_p0_vreg.v = _mm_loaddup_pd( (double *) b_p0_pntr++ );   /* load and duplicate */
    b_p1_vreg.v = _mm_loaddup_pd( (double *) b_p1_pntr++ );   /* load and duplicate */
    b_p2_vreg.v = _mm_loaddup_pd( (double *) b_p2_pntr++ );   /* load and duplicate */
    b_p3_vreg.v = _mm_loaddup_pd( (double *) b_p3_pntr++ );   /* load and duplicate */

    /* First row and second rows */
    c_00_c_10_vreg.v += a_0p_a_1p_vreg.v * b_p0_vreg.v;
    c_01_c_11_vreg.v += a_0p_a_1p_vreg.v * b_p1_vreg.v;
    c_02_c_12_vreg.v += a_0p_a_1p_vreg.v * b_p2_vreg.v;
    c_03_c_13_vreg.v += a_0p_a_1p_vreg.v * b_p3_vreg.v;

    /* Third and fourth rows */
    c_20_c_30_vreg.v += a_2p_a_3p_vreg.v * b_p0_vreg.v;
    c_21_c_31_vreg.v += a_2p_a_3p_vreg.v * b_p1_vreg.v;
    c_22_c_32_vreg.v += a_2p_a_3p_vreg.v * b_p2_vreg.v;
    c_23_c_33_vreg.v += a_2p_a_3p_vreg.v * b_p3_vreg.v;
  }

  C( 0, 0 ) += c_00_c_10_vreg.d[0];  C( 0, 1 ) += c_01_c_11_vreg.d[0];  
  C( 0, 2 ) += c_02_c_12_vreg.d[0];  C( 0, 3 ) += c_03_c_13_vreg.d[0]; 

  C( 1, 0 ) += c_00_c_10_vreg.d[1];  C( 1, 1 ) += c_01_c_11_vreg.d[1];  
  C( 1, 2 ) += c_02_c_12_vreg.d[1];  C( 1, 3 ) += c_03_c_13_vreg.d[1]; 

  C( 2, 0 ) += c_20_c_30_vreg.d[0];  C( 2, 1 ) += c_21_c_31_vreg.d[0];  
  C( 2, 2 ) += c_22_c_32_vreg.d[0];  C( 2, 3 ) += c_23_c_33_vreg.d[0]; 

  C( 3, 0 ) += c_20_c_30_vreg.d[1];  C( 3, 1 ) += c_21_c_31_vreg.d[1];  
  C( 3, 2 ) += c_22_c_32_vreg.d[1];  C( 3, 3 ) += c_23_c_33_vreg.d[1]; 
}

【论文解读】CLTune: A Generic Auto-Tuner for OpenCL Kernels

make warning: ‘clCreateCommandQueue’ is deprecated [-Wdeprecated-declarations]

bandwidth.c: In function ‘main’:
bandwidth.c:262:5: warning: ‘clCreateCommandQueue’ is deprecated [-Wdeprecated-declarations]
     command_queue = clCreateCommandQueue(context, device_id, 0, &ret);
     ^
In file included from bandwidth.c:37:0:
/usr/include/CL/cl.h:1359:1: note: declared here
 clCreateCommandQueue(cl_context                     /* context */,
 ^

【竞品调研】Making the most of Arm NN for GPU inference: OpenCL Tuner

Making the most of Arm NN for GPU inference: OpenCL Tuner
https://community.arm.com/developer/ip-products/processors/b/ml-ip-blog/posts/arm-nn-gpu-inference-with-opencl-tuner

OpenCL tuner

ACL implements the so-called Local Work-group Size (LWS) tuner. The idea is to improve the cache utilization at L1 and L2 levels and reduce accessing global memory as much as possible.

Figure 2 shows a basic representation of OpenCL architecture. The compute device can be a GPU, a CPU, or an accelerator. Inside the compute device we have several compute units (GPU core, CPU core, and so on). Each of them has its own L1 memory cache and can execute N threads in parallel, known as work-items. Each thread executes the same piece of code corresponding to an OpenCL kernel, where the thread Id is used to access different memory locations.

Figure 2: OpenCL architecture and memory caches.

To improve L1 memory cache utilization we want the threads of the same work-group to access consecutive memory addresses (memory coalescing).
To optimize L2 cache utilization, we want the compute units to reuse the same memory block.
为了提高一级内存缓存的利用率，我们希望同一工作组的线程访问连续的内存地址（内存合并）；
为了优化二级缓存利用率，我们希望计算单元重用相同的内存块。

To achieve these optimizations for L1 and L2 memory caches, the ACL implements a Local Work-group Size (LWS) tuner to find the optimal configuration to use for each OpenCL kernel type. For a more detailed explanation, you can read this blog and watch this presentation. The impact on the inference performance of the LWS tuner can be huge. This is between 1.12 and 1.8 for different networks, as you can see in the picture below for the three different CL Tuner modes.

前面的图片显示了启用OpenCL Tuner之前（顶部）和之后（底部）的流线型捕获。重点关注GPU使用部分中的非片段队列活动（橙色曲线），突出显示的间隔显示GPU上ML推断过程的开始和结束。请注意，启用调谐器后，与启用调谐器之前的推断间隔（24ms）相比，推断间隔更短（18ms）。这意味着推理性能提高了25%。根据硬件和网络类型的不同，改进是不同的。图片中显示的截图与智能手机视频流上的Unity应用程序中运行在Mali-G72 MP12 GPU上的分割网络的推断相对应。

The Midgard Shader Core

https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/the-midgard-shader-core/single-page

https://developer.arm.com/documentation/100614/0314/OpenCL-optimizations-list/Mali-Midgard-GPU-specific-optimizations

Strange problem about index of vector variable

Correct rate is strange: always is 0.0000, 😢

mat_mult_vec1x2_continue

// TODO: Strange, I think it's right!
__kernel void mat_mult_vec1x2_continue(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0);
    const int row = get_global_id(0);

    CL_ELEM_TYPE aa, bb1, bb2, cc = 0;

    for (int p = 0; p < K; p += 2) {
        aa = *(
                  (__global CL_ELEM_TYPE *)(a + row * M + p) 
              );

        bb1 = *( 
                  (__global CL_ELEM_TYPE *)(b + p * N + col) 
              );
        bb2 = *( 
                  (__global CL_ELEM_TYPE *)(b + (p+1) * N + col) 
              );
        cc.s0 += aa.s0 * bb1.s0 + aa.s1 * bb2.s0;
        cc.s1 += aa.s0 + bb1.s1 + aa.s1 * bb2.s1;
    }
    c[row * N + col] = cc.s0;
    c[row * N + (col+1)] = cc.s1;
}

mat_mult_vec2x2_continue

// TODO: Strange! I think it's right!
__kernel void mat_mult_vec2x2_continue(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0);
    const int row = get_global_id(1);

    CL_ELEM_TYPE aa1, aa2,
                 bb1, bb2,
                 cc1 = 0, 
                 cc2 = 0;

    for (int p = 0; p < K; p+=2) {
        aa1 = *(
                   (__global CL_ELEM_TYPE *)(a + row * M + p)
               );
        aa2 = *(
                   (__global CL_ELEM_TYPE *)(a + (row+1) * M + p)
               );

        bb1 = *(
                   (__global CL_ELEM_TYPE *)(b + p * N + col)
               );
        bb2 = *(
                   (__global CL_ELEM_TYPE *)(b + (p+1) * N + col)
               );
        cc1 = (CL_ELEM_TYPE)
                  (aa1.s0*bb1.s0 + aa1.s1*bb2.s0,    aa1.s0*bb1.s1 + aa1.s1*bb2.s1);
        cc2 = (CL_ELEM_TYPE)
                  (aa2.s0*bb1.s0 + aa2.s1*bb2.s0,    aa2.s0*bb1.s1 + aa2.s1*bb2.s1);
    }
    //*(__global CL_ELEM_TYPE *)(c + row * N + col) = cc1.s0;         *(__global CL_ELEM_TYPE *)(c + row * N + (col+1)) = cc1.s1;
    //*(__global CL_ELEM_TYPE *)(c + (row+1) * N + col) = cc2.s0;     *(__global CL_ELEM_TYPE *)(c + (row+1) * N + (col+1)) = cc2.s0;

    //*(__global CL_ELEM_TYPE *)(c + (row+1) * N + col) = cc2;

    c[row * N + col] = cc1.s0;      c[row * N + (col+1)] = cc1.s1;
    c[(row+1)*N + col] = cc2.s0;    c[(row+1)*N + (col+1)] = cc2.s1;

}

discover optimization strategy for no local memory situation

Do matrix Transpose before matrix multiplication: further optimize matrix transpose
Refer CPU GEMM optimization strategy (Mali GPU withou local memory) step by step: https://github.com/flame/how-to-optimize-gemm/wiki

The Utgard Shader Core

https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/the-utgard-shader-core/single-page

【竞品调研】MACE、MNN的OpenCL AutoTune 策略

移动端OpenCL AutoTune 策略调研：MACE、MNN

OpenCL Tuner

Tuner很早就有实现，比方AMD的clBLAS，2016年在GTC上做报告的CLTune，其相应也有BLAS库，即CLBLAST，性能比clBLAS要好不少。CLTune设计了全面且通用的Tune框架，且支持CUDA，不局限于某一种kernel，且在tune策略上提供了除全局和随机搜索以外的启发搜索策略，如模拟退火、粒子群搜索，在使用上，用户可以集成到自己项目中以离线方式或者在线方式使用。

型号	SoC	架构	制程	核心数	每秒操作数(GFlops)	带宽(GB/s)	时钟频率	L2缓存大小	发布年份
Mali-450	Rockchip RK3328	Utgard	40/28nm	1~8 Cores	14.6 GFlops	?	650 MHz	512KB	2012
Mali-T860	Helio P10	MidgardGen4	28nm	1~16 Cores	23.8 GFlops	?	700 Mhz	256~2048KB	2015Q4
Mali-G72	Kirin970	BifrostGen2	16/12/10nm	1~32 Cores	30.5 GFlops	?	572~800 MHz	128~2048KB	2017Q2
Mali-G77	Kirin980	ValhallGen1	7nm	7~16 Cores	?GFlops	?	850 MHz	512~4096KB	2019Q2
Mali-G78	Kirin9000	ValhallGen2	5nm	7~24 Cores	?GFlops	?	850 MHz	512~2048KB	2020Q2

表：Arm Mali GPU数据来自维基百科的Mali(GPU)

但随着近年来移动端GPU越发强劲，上表是移动端GPU大体的情况数据来自维基百科的Mali(GPU)和Adreno(GPU)，可以看到表格中，Arm Mali GPU支持的最大核心数近年来都在上升，尤其是FP32精度下的每秒乘加数2012年在Mali450上，是14.6GFops，而到了2017年Kirin970的Mali-G72已达到30.5GFlops，而FP16理论上应该在60GFlops左右，先前在Mali-T860上测试MobileNetv1上性能可以做到将近10FPS。

型号	SoC	架构	制程	ALU个数	每秒操作数FP32 (GFlops)	带宽(GB/s)	时钟频率	片上内存大小	发布年份
Adreno430	Snapdragon810	Unified shader model	20nm	256	256/307/332 GFlops	25.6 GB/s	500/600/650 MHz	1536KB	2015
Adreno506	Snapdragon625	Unified shader model + Unified memory	14nm	96	115/124 GFlops	7.4 GB/s	600/650 MHz	128+8KB	2016
Adreno540	Snapdragon835	Unified shader model + Unified memory	10nm	384	545/567 GFlops	29.8 GB/s	710/739 MHz	1024KB	2016
Adreno640	Snapdragon855/855+	Unified shader model + Unified memory	7nm	784	898/1036 GFlops	34 GB/s	585/675 MHz	1024KB	2018

表：SnapDragon Adreno GPU数据来自维基百科的Adreno(GPU)

根据Arm Mali GPU的数据，再结合Adreno GPU，可以发现在片上缓存或L2缓存的上下限、ALU个数的属性等都在逐年增加，而时钟频率基本维持在在600~800MHz，因为功耗和频率成正比，而时钟频率和运算能力并没有直接关系（运算速度还要看流水线的各方面的性能指标如缓存、指令集位数等等）。

在上面的Adreno GPU表中，在架构的术语上做一些解释：

统一着色器模型（Unified shader model）：在三维计算机图形学领域，统一着色器模型（在Direct3D 10中称为“着色器模型4.0”）是指图形处理单元（GPU）中的着色器硬件的一种形式，其中渲染管道中的所有着色器阶段（几何体、顶点、像素等）具有相同的功能。它们都可以读取纹理和缓冲区，并且使用几乎相同的指令集。在Adreno GPU上，OpenCL支持的Image2D内存方式，可以使用L1 cache数据，这点是Arm Mali所不具有的；
统一内存（Unified memory）：该wiki词条指向了共享内存（Shared memory），共享内存是可以由多个程序同时访问的内存，目的是提供它们之间的通信或避免拷贝，方便传递数据。虽然Arm Mali也是统一内存，但其使用带有缓存的全局内存来代替本地（shared）或私有内存。

另外，Adreno430（骁龙810）比Adreno506（骁龙625）虽然前者是4xx系列且是20nm工艺后者14nm，但是骁龙8系列，毕竟瘦死的骆驼比马大，好歹骁龙810是高端的8系列，其ALU个数是256个比骁龙625（的GPU）96个多了1倍有余，且能从每秒操作数和带宽数也能看出。

而且在移动端做深度学习，尤其是卷积以及矩阵乘法的性能需求越来越明显，即使是适用于并行计算的移动GPU，性能的需求也是永无止境的。为了榨干最后的性能，大家都会选择kernel调优，一般是暴力搜索的同时融合人工经验减少搜索空间。

在这方面较早的是前面已经说了，有clBLAS、clBlast，到了端侧AI模型推理框架，较早的有ARM Compute Library、MACE等等，下面将介绍一下在移动端AI推理框架的调优策略和思考。

Add other OCL mini-projects/demos from github

OpenCL error codes (1.x and 2.x) - StreamHPC
https://streamhpc.com/blog/2013-04-28/opencl-error-codes/

https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/OpenCL-1.2-refcard.pdf

OpenCL Reference Pages
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/oclRefPages-Title.html

【渲染】Arm Mali: Tile-Based Rendering

Graphics and Gaming Development | Tile-Based Rendering – Arm Developer
https://developer.arm.com/solutions/graphics-and-gaming/developer-guides/learn-the-basics/tile-based-rendering/single-page

OpenCL examples

Dakkers/OpenCL-examples: basic examples of OpenCL with the C++ API
https://github.com/Dakkers/OpenCL-examples

Performance loss caused by input type of kernel function

I found that it seems exits a performance loss using different type as input type for kernel function. Define two kernel functions, accomplishing same thing:

function1

__kernel void global_bandwidth_vec2(const int heightA, const int widthA, __global const float *a, __global float *b) {
    const int idx = get_global_id(0);
    const int step = idx << 1;

    CL_ELEM_TYPE value = *((__global CL_ELEM_TYPE *)(a + step));
    *((__global CL_ELEM_TYPE *)(b + step)) = value;
}

function2

__kernel void global_bandwidth_vec2(const int heightA, const int widthA, __global const CL_ELEM_TYPE *a, __global CL_ELEM_TYPE *b) {
    const int idx = get_global_id(0);
    const int step = idx << 1;

    CL_ELEM_TYPE value = *((__global CL_ELEM_TYPE *)(a + step));
    *((__global CL_ELEM_TYPE *)(b + step)) = value;
}

These two functions are same except input-variable type: function1 using float; function2 using MACRO (defined in clBuildProgram) instead.

【论文解读】CLBlast: A Tuned OpenCL BLAS Library

CLBlast

code: https://github.com/CNugteren/CLBlast
paper: https://arxiv.org/abs/1705.05249

如何查看安卓当前gpu在使用

查看gpu中断计数是否在增加，见/proc/interrupts

zero copy: map / unmap

https://community.arm.com/cn/f/discussions/8759/opencl-map-unmap-cpu-gpu

benchmark.py: for auto-benchmark for different variable type (half, short, int, float, double) with matrix size (2048 * 2048)

Preferred / native vector sizes?

A good vector size is important.

$ clinfo | grep /                                                                         
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)

Support halfN precision for GPU and CPU

GPU cl_khr_fp16: correct rate is wrong for HalfN when N is bigger than 1;
CPU fp16: segmentation fault when bigger than 128*128, such as 256*256.

Besides, about data_size variable, should I define data_size variables respectly for CPU and GPU? if using same data_size variable for different CPU-type or GPU-type (such as float cpu , half gpu), does it cause error?

cutlass: Efficient GEMM in CUDA

gemm optimization for FP32

Current Performance

Without selecting local work size, GEMM performances are below:

float4

[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.220724 s 9.729284 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.081261 s 26.427127 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1020 0.075207 s 28.442905 GFLOPS

half4

[256,256,1] [1,1,1] p+=8 100times 1024x1024x1024 0.103472 s 20.754221 GFLOPS
[256,256,1] [4,4,1] p+=8 100times 1024x1024x1024 0.061210 s 35.084041 GFLOPS
[256,256,1] [4,4,1] p+=12 100times 1024x1024x1024 0.058183 s 36.765208 GFLOPS

Due to concentrating on FP32, don't care half type performance (after optimization of fp32, fp16 will start).

Search Local Work Size

Here, I do some searches for optimal lcoal work size(s) for FP32-float4 using ./mat-mult/hyper-opt/. The local work sizes setting of best performance (above 28GFLOPS) for 1024x1024x1020 and 2048x2048x2040 are (each performance result of corresponding local work size is based on the average of 100-times executions):

# For 1024x1024x1020
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS

# For 2048x2048x2040
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 32, 1} 0.590064 s 29.001546 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 8, 1} 0.591137 s 28.948886 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 8, 1} 0.595140 s 28.754175 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 16, 1} 0.594070 s 28.805964 GFLOPS

We can find some law: generally speaking, so as to attain the fast setting of local work size, the first dimension is set to a multiple of 4, and the second dimension is set to a multiple of 16. However, it's not always true, but we can be sure these settings of local work size above are great!

The performance results above is an abstract. More detailed performance results're below:

1024x1024x1020

lws_calc_fp32_float4_max_256_gls_1024x1024.log

$ cat lws_calc_fp32_float4_max_256_gls_1024x1024.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 64, 1} 0.152138 s 14.060269 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 64, 1} 0.075522 s 28.323962 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 1, 1} 0.216090 s 9.899101 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 1, 1} 0.325693 s 6.567833 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 1, 1} 0.216341 s 9.887623 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 1, 1} 0.216321 s 9.888505 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 1, 1} 0.216060 s 9.900455 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {128, 1, 1} 0.212686 s 10.057528 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 1, 1} 0.215956 s 9.905246 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 4, 1} 0.190076 s 11.253866 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 4, 1} 0.077135 s 27.731752 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 4, 1} 0.076620 s 27.918146 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 4, 1} 0.076803 s 27.851552 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 4, 1} 0.076623 s 27.917206 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 4, 1} 0.076525 s 27.952771 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 8, 1} 0.074966 s 28.534328 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 8, 1} 0.075136 s 28.469790 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 8, 1} 0.078450 s 27.266819 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 8, 1} 0.075476 s 28.341420 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 8, 1} 0.075259 s 28.423166 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 128, 1} 0.203697 s 10.501347 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 16, 1} 0.076796 s 27.854243 GFLOPS

lws_calc_fp32_float4_max_256_gls_1024x1024

$ cat lws_calc_fp32_float4_max_256_gls_1024x1024.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 64, 1} 0.152138 s 14.060269 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 64, 1} 0.075522 s 28.323962 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 1, 1} 0.216090 s 9.899101 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 1, 1} 0.325693 s 6.567833 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 1, 1} 0.216341 s 9.887623 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 1, 1} 0.216321 s 9.888505 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 1, 1} 0.216060 s 9.900455 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {128, 1, 1} 0.212686 s 10.057528 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 1, 1} 0.215956 s 9.905246 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {64, 4, 1} 0.190076 s 11.253866 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 4, 1} 0.077135 s 27.731752 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 4, 1} 0.076620 s 27.918146 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 4, 1} 0.076803 s 27.851552 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 4, 1} 0.076623 s 27.917206 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 4, 1} 0.076525 s 27.952771 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 32, 1} 0.074372 s 28.762086 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 32, 1} 0.074420 s 28.743674 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 32, 1} 0.073848 s 28.966047 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 8, 1} 0.074966 s 28.534328 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 8, 1} 0.075136 s 28.469790 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {32, 8, 1} 0.078450 s 27.266819 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 8, 1} 0.075476 s 28.341420 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 8, 1} 0.075259 s 28.423166 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 128, 1} 0.203697 s 10.501347 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {1, 16, 1} 0.073795 s 28.986889 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {4, 16, 1} 0.074167 s 28.841656 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {8, 16, 1} 0.074154 s 28.846475 GFLOPS
>>> [INFO] 1.00 CL_GPU 1024x1024x1020 {256, 256, 1} {16, 16, 1} 0.076796 s 27.854243 GFLOPS

2048x2048x2040

lws_calc_fp32_float4_max_256_gls_2048_2048.log

$ cat lws_calc_fp32_float4_max_256_gls_2048_2048.log | grep "1.00 CL_GPU"
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 64, 1} 1.618282 s 10.574650 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 64, 1} 0.606946 s 28.194880 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {64, 1, 1} 5.155221 s 3.319501 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 1, 1} 2.594596 s 6.595539 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 1, 1} 5.162974 s 3.314516 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 1, 1} 5.175718 s 3.306355 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 1, 1} 5.152315 s 3.321373 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {128, 1, 1} 5.209523 s 3.284900 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 1, 1} 5.151586 s 3.321843 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {64, 4, 1} 1.828126 s 9.360821 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 4, 1} 0.637480 s 26.844399 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 4, 1} 0.623718 s 27.436699 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 4, 1} 0.616397 s 27.762548 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 4, 1} 0.611600 s 27.980329 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 4, 1} 0.612205 s 27.952679 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 32, 1} 0.599180 s 28.560282 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 32, 1} 0.606689 s 28.206826 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 32, 1} 0.590064 s 29.001546 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 8, 1} 0.591137 s 28.948886 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 8, 1} 0.595140 s 28.754175 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {32, 8, 1} 0.619850 s 27.607898 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 8, 1} 0.597600 s 28.635798 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 8, 1} 0.598308 s 28.601906 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 128, 1} 1.676186 s 10.209347 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {1, 16, 1} 0.600695 s 28.488251 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {4, 16, 1} 0.594070 s 28.805964 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {8, 16, 1} 0.598946 s 28.571441 GFLOPS
>>> [INFO] 1.00 CL_GPU 2048x2048x2040 {512, 512, 1} {16, 16, 1} 0.631024 s 27.119014 GFLOPS

Optimization for float type. Next step is to try to change these points in gemm_interleave_trans.c:

global work size
the shape or load format of aI and bT
the shape or store format of c
try mix use of loadN or storeN or floatN
refer more kernels implementations: Add other OCL mini-projects/demos from github · Issue #1 and other GEMM implementations in ACL.

Follow CPU GEMM optimization guide

https://github.com/flame/how-to-optimize-gemm/wiki

Performance between clBLAS and clBlast on AMD embedded GPU

clBLAS

The build guide of clBLAS on linux is not support on its official repo. but founded in opencl-caffe repo. here and detailed command as below:

cd src & mkdir build & cd build
cmake -DCMAKE_INSTALL_PATHATH=/opt/clBLAS2.7 -DOCL_VERSION TRING=2.0 -DCMAKE_BUILD_TYPE=Release -DPRECOMPILE_GEMM_PRECISION_SGEMM:BOOL=ON-DPRECOMPILE_GEMM_TRANS_NN:BOOL=ON -DPRECOMPILE_GEMM_TRANS_NT:BOOL=ON-DPRECOMPILE_GEMM_TRANS_TN:BOOL=ON -DPRECOMPILE_GEMM_TRANS_TT:BOOL=ON ..

But it seems some problem when executing. TODO fix

clBlast

Mali Offline Compiler

Mali Offline Compiler – Arm Developer
https://developer.arm.com/products/software-development-tools/graphics-development-tools/mali-offline-compiler

https://www.khronos.org/opencl/

FP32 operations/clock and Thread count

https://community.arm.com/developer/tools-software/graphics/f/discussions/47979/confusion-of-thread-count-and-fp32-operations

【竞品调研】TensorFlow Lite GPU OpenCL在卷积上的选择策略

// tensorflow/tensorflow/lite/delegates/gpu/cl/selectors/convolution_selector.cc
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/cl/selectors/convolution_selector.cc#L40
std::unique_ptr<GPUOperation> SelectConvolution(
    const Convolution2DAttributes& attr, const BHWC& dst_shape,
    const DeviceInfo& device_info, const OperationDef& op_def,
    ModelHints hints) {
  if (device_info.IsAdreno()) {
    return SelectConvolutionAdreno(attr, dst_shape, device_info, op_def, hints);
  } else if (device_info.IsPowerVR() || device_info.IsAMD() ||
             device_info.IsIntel()) {
    return SelectConvolutionPowerVR(attr, device_info, op_def);
  } else if (device_info.IsNvidia()) {
    return SelectConvolutionNVidia(attr, dst_shape, device_info, op_def);
  } else if (device_info.IsMali()) {
    return SelectConvolutionMali(attr, dst_shape, device_info, op_def);
  } else {
    return SelectConvolutionAdreno(attr, dst_shape, device_info, op_def, hints);
  }
}

std::unique_ptr<GPUOperation> SelectConvolutionForWinograd(
    const Convolution2DAttributes& attr, const BHWC& dst_shape,
    const DeviceInfo& device_info, const OperationDef& op_def,
    ModelHints hints) {
  if (device_info.IsAdreno()) {
    return SelectConvolutionWinogradAdreno(attr, dst_shape, device_info, op_def,
                                           hints);
  } else if (device_info.IsPowerVR() || device_info.IsAMD() ||
             device_info.IsNvidia() || device_info.IsIntel()) {
    ConvPowerVR conv =
        CreateConvPowerVRWino4x4To6x6(device_info, op_def, attr, &dst_shape);
    return absl::make_unique<ConvPowerVR>(std::move(conv));
  } else if (device_info.IsMali()) {
    return SelectConvolutionWinogradMali(attr, dst_shape, device_info, op_def);
  } else {
    return SelectConvolutionWinogradAdreno(attr, dst_shape, device_info, op_def,
                                           hints);
  }
}

std::unique_ptr<GPUOperation> SelectConvolutionWithDynamicWeights(
    const Convolution2DAttributes& attr, const BHWC& weights_shape,
    const BHWC& dst_shape, const DeviceInfo& device_info,
    const OperationDef& op_def, ModelHints hints,
    ConvWeightsDescription* weights_desc) {
  if (device_info.IsAdreno()) {
    return SelectConvolutionDynamicWeightsAdreno(attr, weights_shape, dst_shape,
                                                 device_info, op_def, hints,
                                                 weights_desc);
  } else if (device_info.IsMali()) {
    return SelectConvolutionDynamicWeightsMali(attr, weights_shape, dst_shape,
                                               device_info, op_def, hints,
                                               weights_desc);
  } else {
    ConvPowerVR conv = CreateConvPowerVRDynamicWeights(
        device_info, op_def, attr, weights_shape, &dst_shape);
    *weights_desc = conv.GetConvWeightsDescription();
    return absl::make_unique<ConvPowerVR>(std::move(conv));
  }
}

【问题排查】MacOS X86-intel gpu OpenCL计算错误但移动端GPU和Windows-NV 计算正确：cl::Image2D的sampler_t中的CLK_NORMALIZED_COORDS_<TRUE | FALSE>

  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

改为

  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;

Bug of floatN/halfN support: fail to copy data from device to host

Log

>>> [ERROR] failed to copy data from device to host.-14

I lookuped error code -14 from this link, saying:

error code: -14
OpenCL Error Flag: CL_EXEC_STATUS_ERROR_ FOR_EVENTS_IN_WAIT_LIST
Functions:
Description: if the execution status of any of the events in event_list is a negative integer value.

【问题排查】GPU内存泄露问题排查思路与解决

背景：在荣耀30Pro手机Mali-G76上，根据adb shell的top命令，观察在同一模型多次run下，会有明显内存增长。排查过程如下：

用户API侧：加载模型、设置输入、获取输出、predictor执行分别for循环100w次，定位是predictor Run。涉及框架底层；
排除特殊op：尝试非业务模型、业务模型，均观察到有内存增长的情况。和业务模型的特殊op无关；
拆分原始模型为小模型：基于构建模型的脚本生成仅有conv、relu的一层小模型，转为部署模型，也存在内存增长情况。可能和io_copy、layout转换有关；
单元测试：使用conv/act的单元测试，该单元测试不包括io_copy和layout op转换，发现也存在内存增长；
进入conv实现：分为PrepareForRun、Run、ReInitWhenNeeded，分别放入100w次的for循环内进一步观察定位泄露点，定位到是Run部分，在对Run里的逐行代码分析，逐行放入100w次的for循环内，定位到是command_queue的enqueueNDRangeKernel在多次循环时有明显内存泄露情况；
观察是否是框架本身引入的问题：command_queue是实体非指针，观察其他框架tnn的调用方式，也是C++ OPENCL API，也存在泄露问题（需要多次运行），进一步观察其他框架（MNN/TFLite/TNN）也均有内存泄露问题；
确定该问题是gpu驱动bug，即该opencl的enqueueNDRangeKernel方法存在内存泄露。

解决方法：

尝试升级手机系统，得到解决：升级前OPENCL Driver版本为：OpenCL 2.0 v1.r18p0-01rel0.1e0ebad9d712581fe1bdcd515a8fdf2a，升级后为：OpenCL 2.0 v1.r18p0-01rel0.8e7096aeb046d5aec7d5b43472e6bf72；
如果无法升级系统，则考虑在用户侧的opencl预判api is_opencl_valid()上，增加黑名单，即在原有的检查so库位置，符号完全性，fp16支持上，增加考虑判断opencl driver版本的检查；
影响：2的解决方法是在框架层面解决，该问题也可以由业务层逻辑预判检查，与业务层原有的黑名单结合起来。由于该问题是内存泄露，暴露场景只可能是连续跑模型的情况下，造成的结果是内存爆炸，手机重启。对于少次数和非长时应用不会出现，可以忽略跳过，对于长时如视频分割，实时的业务会受到较大影响。

【年度总结】2020年opencl工作汇总

OpenCL 业务支持

LiteKit手机端OPENCL和CPU模型（已经开源）

2个GPU模型，手势检测与超分，1个cpu模型，人像分割；

视频超分模型增加opencl前后处理、uint8支持，PR内容详细：PaddlePaddle/Paddle-Lite#3049
对视频超分模型做了细致的内存占用比较Lite和Mobile：开启内存复用前后，输入规模越大，物理内存节省越多大概50%+，详见《超分模型与多层间内存复用》http://agroup.baidu.com/paddle-infer/view/office/2539063
超分模型支持OpenCL多层内存复用：PaddlePaddle/Paddle-Lite#3077
超分gpu模型执行期挂掉：更换模型与库匹配的版本后问题解决，库和模型版本不匹配，模型版本太老

手百lens手机端OPENCL和CPU业务（已经上线）

修复lens_mnasnet找不到kernel_func_name_导致挂的问题，PaddlePaddle/Paddle-Lite#3085
修复lens mnasnet加载模型时在高通手机上出现segfault问题。原因：模型固化的op和库的实现不兼容
发现头文件不同导致，模型加载挂掉，原因ios和android头文件不能共用
旧的int16模型不兼容新版lite在加载模型时的反量化fp32会出现Crash，更新模型后1.8Paddle另存重新转换修复
lens_saoma_yolonano的cpu模型转换时出现conv-conv pass执行挂掉，原因是融合条件没有前置，判断条件出现过晚，找负责人解决；

男女变换GAN手机端OPENCL模型（未上线，算法未调整好）

修复模型转换出现layout pass找不到对应匹配layout kernel的问题；
修复/支持找不到tanh、exp的opencl kernel；
修复io_copy报错；

反黄反暴力手机端OPENCL模型（未上线，算法未调整好）

修复int16 opencl模型加载fc权重出错的问题
修复执行过程出现CL_INVALID_BUFFER_SIZE

图像修复手机端OPENCL模型

conv3x3带group的情况不支持，增加支持；
concat计算结果不对，已修复并提pr。特殊机型concat单测结果不对；
特殊机型骁龙625，网络计算结果错误，conv3x3 group的部分结果在内存写入时失败，低端opencl驱动版本不支持矢量数组；

如流人像分割手机端OPENCL模型（准备上线）

人像分割增加hard_sigmoid算子，性能提升37ms->26ms；
修复dropout注册缺失Mask的bug；
定位荣耀特定机型（荣耀V30）的内存泄露问题，原因是驱动有内存泄露bug；

如流人像分割PC端OPENCL模型（准备上线）

增加Mac对人像分割模型FP32类型的支持兼容：PaddlePaddle/Paddle-Lite#4827 、 PaddlePaddle/Paddle-Lite#4757

【竞品调研】TensorFlow Lite GPU OpenCL WorkGroup TuningType策略浅析

最近发现TensorFlow Lite在GPU方面的性能有不小提升，先前了解到起初是支持的OpenGL来完成计算，猜想可能是考虑到GL的更广阔的的兼容性（不同的GPU版本，兼容的新老的库版本），但后续这次对GPU以OpenCL进行支持，应该考虑的更多是计算性能，也是与TFLite的相关竞品，如MACE / Paddle-Mobile / MNN / TNN在OpenCL上的支持和性能确实不容忽视。

说到OpenCL，深入一些都会谈及GPU的Kernel调优的手段和策略。根据阅读TensorFlow Lite在GPU方面的代码，发现其GPU/CL部分有tuning_parameters.h这一文件：

// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/tuning_parameters.h
// https://github.com/tensorflow/tensorflow/blob/465aeca04268f6e19d5f845610cc7ccaf03f5b8d/tensorflow/lite/delegates/gpu/cl/kernels/tuning_parameters.h
enum class TuningType { EXHAUSTIVE, FAST };

struct TuningParameters {
  ProfilingCommandQueue* queue;
  const DeviceInfo* info;
  TuningType tuning_type = TuningType::EXHAUSTIVE;
};

ProfiliingCommandQueue是class ProfilingCommandQueue : public CLCommandQueue，在原有父类CLCommandQueue基础上，增加了opencl kernel计时、找最佳work group（GetBestWorkGroupIndex）等方法。

此外，又发现可能和性能调优（tuning/tune）相关的目录或者文件：

device_info.cc：定义了不同型号的GPU的相关信息，针对不同硬件有细致的区分，为后续调优做指导：
1. 高通Adreno：区分了型号代数以类似4xx、6xx的方法，此外，在MaxWaveCounts、ComputeUnit的RegisterMemSize、WaveSize有比较区分。MaxWaveCounts：针对Adreno640固定设置是30个，其它6xx系列为16个；ComputeUnit的RegisterMemSize：针对Adreno640是128*144*16，其它6xx系列为128*96*16；WaveSize：对于<400系列不支持、<600系列会判断是否full_wave进而选择64或32、其它型号也会判断是否full_wave使用进而判断是128还是64；
2. ARM mali：对型号区分：T6xx、T7xx、T8xx系列，对架构区分：Midgad架构（T6xx、T7xx、T8xx）、Vallhall架构（G57、G77）、Bifrost Gen1架构（G31、G51、G71），Bifrost Gen2架构（G52、G72）、Bifrost Gen3架构（G76）；
3. 其它：当然还有PowerVR、NVIDIA等GPU型号，但相比上面没有过分细致的区分。
work_group_picking.cc、work_group_picking.h：这个后续会详细说明，主要定义了上述两种tunning_type的策略实现，以及相关的辅助函数；
tuning_parameters.h、tuning_parameters.cc；
inference_context.cc：InitFromGraphWithTransforms有定义模型从原始Graph转换为适合GPU执行的Graph的流程。大体主要有3个步骤：
1. 设备相关：拿到context/device/queue/program cache、判断GPU类型（mali/powervr）、设置是否需要Flush以及Flush周期间隔（mali需要手动设置Flush间隔、PowerVR不需要，补充：clFlush用于分发所有设备中已经进入命令队列的命令给设备并不保证执行完成）；
2. 模型相关：转换GPU的Graph、分配GPU资源如内存上传模型权重到GPU上、逐模型的节点释放CPU模型表达；
3. 优化相关：TuningParameter的初始化（含profiling_queue、设备信息、TuningType）、逐模型的节点开始GPUOperation::Tune设置WorkGroup。
gpu_operation.h、gpu_operation.cc：实现了GPUOperation::Tune方法，包含整个Tune的过程，目前看来每次只会选择一种，并没有在多种WorkGroup下做选择，即目前还没支持针对某个确定Operation做大规模和批量WorkGroup的性能Tune。

那就先来看看Tune这一方法，该方法进入时，会先获取KernelWorkGroups，注意是一个候选表std::vector<int3> possible_work_groups，然后在最后if-else分支的else情况中，选择这些里最好的best_work_group_index，作为当前Operation最终的work_group。

// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc

absl::Status GPUOperation::Tune(const TuningParameters& params) {
  std::vector<int3> possible_work_groups;
  GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
                              &possible_work_groups);
  if (possible_work_groups.empty()) {
    return absl::NotFoundError(
        "Can not found work_group size to launch kernel");
  }
  if (possible_work_groups.size() == 1) {
    work_group_size_ = possible_work_groups[0];
    return absl::OkStatus();
  } else {
    RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
    int best_work_group_index;
    RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
        kernel_, *params.info, grid_size_, possible_work_groups,
        &best_work_group_index));
    work_group_size_ = possible_work_groups[best_work_group_index];
    return absl::OkStatus();
  }
}

void GPUOperation::GetPossibleKernelWorkGroups(
    TuningType tuning_type, const DeviceInfo& device_info,
    const KernelInfo& kernel_info, std::vector<int3>* work_groups) const {
  GetPossibleWorkGroups(tuning_type, device_info, kernel_info, grid_size_,
                        work_groups);
}

Tune的流程是不可避免的，每个GPU Operation node都会有Tune操作，而不是可选项，即GetPossibleKernelWorkGroups方法是必然要进入的，并拿到可能的一个或者多个work group，即std::vector<int3> possible_work_groups。当有多个work groups时，会选择最佳workGroup，即会执行后续if-else的else分支的GetBestWorkGroupIndex方法，而当只有一个work groups时，直接返回，一个work groups也没有则直接报错提示absl::NotFoundError("Can not found work_group size to launch kernel");。

1. 找候选work group

我们知道，当opencl的cl kernel在给定不能改动的情况下，性能和work group的设置策略有极大关系，GPUOperation::GetPossibleKernelWorkGroups做了细致的设置策略，该方法会调用GetPossibleWorkGroups，暂且称之为通用策略，作为各个GPU Operation继承的默认策略。此外除了继承，各个GPU Operation node子类也可能做一些自定义的策略，目前发现有名为GetPossibleWorkGroupsConv的方法，发现对该带Conv后缀的设置策略的调用有如下Conv方法：

可以看到目前调用带Conv后缀的方法的，主要是conv_buffer_1x1和conv_transpose，以及针对powervr实现的conv_powervr（但实际上，这个名字改为conv_general可能更合适，因为其它架构的GPU如AMD/Intel及Adreno等在某些情况下，也有用到）。此外，根据目录下包含Conv关键字的文件，可以其中可以看到部分Conv并没有调用带Conv的后缀的work group设置策略，其中包括depthwise_conv、conv_texture、conv_constant、conv_3d等。换言之，这种特殊的GetPossibleWorkGroupsConv是针对性增加的。

下面我们深入GetPossibleWorkGroupsConv和GetPossibleWorkGroups看看他们的执行策略，还能怎么划分。

// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272

void GetPossibleWorkGroups(TuningType tuning_type,
                           const DeviceInfo& device_info,
                           const KernelInfo& kernel_info, const int3& grid,
                           std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST:
      work_groups->push_back(
          GetWorkGroup(grid, kernel_info.max_work_group_size));
      return;
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

void GetPossibleWorkGroupsConv(TuningType tuning_type,
                               const DeviceInfo& device_info,
                               const KernelInfo& kernel_info, const int3& grid,
                               std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST: {
      int max_z_size = 16;
      if (device_info.IsAdreno()) {
        max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
      }
      max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
      work_groups->push_back(
          GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
      return;
    }
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

我们深入GetPossibleWorkGroupsConv和GetPossibleWorkGroups后（见如上代码），可以把找候选work group的情况和策略分为如下几种：

情况1：通用work group设置策略（GetPossibleWorkGroups）
1. TuningType::FAST->GetWorkGroup
2. TuningType::EXHAUSTIVE->GetWorkGroupsAlignedToGrid
3. default: <8,4,1>
情况2：部分Conv的work group设置策略（GetPossibleWorkGroupsConv）
1. TuningType::FAST->GetWorkGroupConv（情况1通用和情况2，仅这里不同）
2. TuningType::EXHAUSTIVE->GetWorkGroupsAlignedToGrid
3. default: <8,4,1>

综上来说，排除掉default的情况，Tuning策略有通用FAST、通用EXHAUSTIVE和非通用FAST三种，下面我们将会逐个分析实现。

Implement Conv progress

refer: A study of OpenCL image convolution optimization
https://www.evl.uic.edu/kreda/gpu/image-convolution/

Unstable benchmark result for GFLOPS

elementwise_mul

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void elementwise_mul(__global image2d_t input, __global image2d_t bias,
                              __write_only image2d_t outputImage) {
  int x = get_global_id(0);
  int y = get_global_id(1);
  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
  int2 coords;
  coords.x = x;
  coords.y = y;
  half4 in = read_imageh(input, sampler, coords);
  half4 biase = read_imageh(bias, sampler, coords);
  half4 output = in * biase;
  write_imageh(outputImage, coords, output);
}

__kernel void channel_mul(__global image2d_t input, __global image2d_t bias,
                          __write_only image2d_t outputImage, int w) {
  int x = get_global_id(0);
  int y = get_global_id(1);
  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
  int2 coords;
  coords.x = x;
  coords.y = y;
  int2 coords_bias;
  coords_bias.x = x / w;
  coords_bias.y = 0;
  half4 in = read_imageh(input, sampler, coords);
  half4 biase = read_imageh(bias, sampler, coords_bias);
  half4 output = in * biase;
  write_imageh(outputImage, coords, output);
}

// etc : 1 1 1 72
// run time Y  [value,0,0,0] * 72
__kernel void channel_mul_d2(__global image2d_t input, __global image2d_t bias,
                             __write_only image2d_t outputImage, int w) {
  int x = get_global_id(0);
  int y = get_global_id(1);
  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
  int2 coords;
  coords.x = x;
  coords.y = y;

  int2 coords_bias0;
  int2 coords_bias1;
  int2 coords_bias2;
  int2 coords_bias3;

  coords_bias0.x = x / w * 4;
  coords_bias0.y = 0;

  coords_bias1.x = x / w * 4 + 1;
  coords_bias1.y = 0;

  coords_bias2.x = x / w * 4 + 2;
  coords_bias2.y = 0;

  coords_bias3.x = x / w * 4 + 3;
  coords_bias3.y = 0;

  half4 biase0 = read_imageh(bias, sampler, coords_bias0);
  half4 biase1 = read_imageh(bias, sampler, coords_bias1);
  half4 biase2 = read_imageh(bias, sampler, coords_bias2);
  half4 biase3 = read_imageh(bias, sampler, coords_bias3);

  half4 biase = {biase0.x, biase1.x, biase2.x, biase3.x};
  half4 in = read_imageh(input, sampler, coords);
  half4 output = mad(in, biase, 0);
  write_imageh(outputImage, coords, output);
}

__kernel void channel_mul_d4(__global image2d_t input, __global image2d_t bias,
                          __write_only image2d_t outputImage, int w) {
  int x = get_global_id(0);
  int y = get_global_id(1);
  const sampler_t sampler =
      CLK_NORMALIZED_COORDS_TRUE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
  int2 coords;
  coords.x = x;
  coords.y = y;
  int2 coords_bias;
  coords_bias.x = x / w;
  coords_bias.y = 0;
  half4 in = read_imageh(input, sampler, coords);
  half4 biase = read_imageh(bias, sampler, coords_bias);
  half4 output = in * biase;
  write_imageh(outputImage, coords, output);
}

Define half16 as input-variable-type for kernel function

This code occurs error code (-14) when execution, ref: #6

__kernel void global_bandwidth_vec16_v2(const int heightA, const int widthA, __global half16 *a, __global half16 *b) {
    const int idx = get_global_id(0);
    const int step = idx << 4;

    half16 value = *(a + step);
    *(b + step) = value;
}

【工具调研】Arm Streamline Performance Analyzer & Arm Mali GPU datasheet

也可以申请试用（Try For Free），本调研集中在Mali GPU的OpenCL性能获取。

Introduction to Streamline

Streamline helps you optimize software for devices that use Arm processors.

Evaluate where the software in your system spends most of its time by capturing a performance profile of your application running on a target device. Quickly determine whether your performance bottleneck relates to the CPU processing or GPU rendering using interactive charts and comprehensive data visualizations.

For CPU bottlenecks, use the native profiling functionality to locate specific problem areas in your application code. Investigate how processes, threads, and functions behave, from high-level views, right down to line-by-line source code analysis. The basic profile is based on regular sampling of the PC (Program Counter) of the running threads, allowing identification of the hotspots in the running application. Hardware performance counters that are provided by the target processors can supplement this analysis. These counters enable hotspot analysis to include knowledge of hardware events such as cache misses and branch mispredictions.

For GPU bottlenecks, use performance data from the Arm Mali GPU driver and hardware performance counters to explore the rendering workload efficiency. Visualize the workload breakdown, pipeline loading, and execution characteristics to quickly identify where to apply rendering optimizations.

With Streamline, you can:

Find hot spots in your code to be targeted for software optimization.
Identify the processor that is the major bottleneck in the performance of your application.
Use CPU performance counters to provide insights into L1 and L2 cache efficiency, enabling cache-aware profiling.
Identify the cause of heavy rendering loads which cause poor GPU performance, and use GPU performance counters to identify workload inefficiencies.
Reduce device power consumption and improve energy efficiency by optimizing workloads using performance counters from the CPU, GPU, and memory system.

benchmark for various type (floatN, intN, halfN, doubleN, shortN) using naive implementation

NV Card

Temporarily benchmark on NV Card

Machine: linux-W580-G20, x86_64-linux-gnu
Device: TITAN X (Pascal), 12189MiB
M: 1024; N: 1024; K: 1024

Naive Kernel

float; 1024 1024 1024; 10; 353.967207 GFLOPS
int; 1024 1024 1024; 10; 356.647841 GFLOPS

__kernel void mat_mult_naive(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
    const int col = get_global_id(0);
    const int row = get_global_id(1);
    CL_ELEM_TYPE res = 0;

    for (int p = 0; p < K; p++) {
        res += a[row * M + p] * b[p * N + col];
    }   
    c[row * N + col] = res;
}

[bandwidth] Bandwidth for typeN and compare with clpeak result

Before, set max freq. for gpu and cpu using scrips in tools of this repo.

Calculate bandwidth for typeN: intN, floatN, halfN;
Compare with clpeak result.

clpeak:

Platform: ARM Platform
  Device: Mali-T860
    Driver version  : 1.2 (Linux ARM64)
    Compute units   : 4
    Clock frequency : 800 MHz

    Global memory bandwidth (GBPS)
      float   : 3.84
      float2  : 6.00
      float4  : 7.33
      float8  : 6.01
      float16 : 5.78

    Single-precision compute (GFLOPS)
      float   : 22.86
      float2  : 44.68
      float4  : 44.51
      float8  : 41.46
      float16 : 46.16

    half-precision compute (GFLOPS)
      half   : 22.83
      half2  : 46.46
      half4  : 93.96
      half8  : 92.44
      half16 : 69.40

    Double-precision compute (GFLOPS)
      double   : 3.60
      double2  : 3.54
      double4  : 20.92
      double8  : 20.60
      double16 : 20.35

    Integer compute (GIOPS)
      int   : 20.26
      int2  : 49.72
      int4  : 47.51
      int8  : 48.96
      int16 : 41.47

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 4.06
      enqueueReadBuffer          : 2.17
      enqueueMapBuffer(for read) : 2015.28
        memcpy from mapped ptr   : 2.18
      enqueueUnmap(after write)  : 5406.56
        memcpy to mapped ptr     : 2.23

    Kernel launch latency : 78.36 us

common Error Q&A

GPU 优势

GPU达到CPU最高帧率时的功率消耗只有CPU的一半。这段话来自An Independent Evaluation of
Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 | Berkeley Design Technology, Inc. July 2015，原文如下：

Qualcomm has reported that the GPU mode of the demo consumes half as much power as the CPU mode when throttling the frame rate of the GPU mode to match the highest frame rate achieved in the CPU mode.

其实这篇基于Adreno430的文章要点如下：算法实现必须最大限度地提高并行性，并符合GPU的内存系统和核心架构，文章讨论了这几点：

最小化GPU和CPU之间的的内存拷贝：snapdragon855使用adreno640的GPU，根据OpenCL-Z有如下数据：
- Host to Device: 10.51 GByte/s
- Device to Host: 4.54 GByte/s
- Device to Device: 23.12 GByte/s
  换句话说，要避免模型串联计算的时候的CPU、GPU交叉调用，尤其是当下一层的feature map特别大的情况下，还要把计算交给GPU来做，因为下载数据慢很可能就不划算；
小心管理有限的快速本地内存（Local Memory）。
即使用高级语言（如OpenCL），也必须掌握GPU的核心体系结构特征，让编程符合架构特征来做优化。例如，代码必须减少分支的使用，并注意使用最合适的SIMD数据类型。

CL_INVALID_KERNEL_ARGS

CL_INVALID_KERNEL_ARGS if the kernel argument values have not been specified.

clEnqueueNDRangeKernel
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

printf_buffer_metadata corrupt!

printf("===== a:%d\n", a8x4[0].s0);

Var a8x4 is of float vector type, but format symbol %d used, which should correct as below:

printf("===== a:%f\n", a8x4[0].s0);

Debug

有个printf函数可以用，非常方便，此外也可以打印vector矢量，khronos.org的OpenCL 1.2和2.0的文档对这个printf Function说明是一样的。这里提一下打印矢量的方法：printf("f4 = %2.2v4hlf\n", f);，其中f4是float4类型。

目前发现能打印的主要是高通骁龙SoC的GPU，但是骁龙系列也有例外，遇到似乎是骁龙410的GPU在加入printf后，在ADB Shell环境执行，会卡主，注释掉printf就不会，可能这个410不支持printf？这个不确定。但mali是没法打印的。

打印kernel时，在打印部分的代码的前后加宏定义如PRINT_KERNEL来控制。相当于两个模式：benchmark和debug，方便切换；
出现计算结果错误时，因为多线程而且规模较大，可以写for从1开始循环各个值，拿到错误的最小规模来排查。尤其是对于Image2D的方式实现的kernel，找到小规模比较方便，另外是需要写一段CPU的代码来模拟Image2D，更好更快的定位这个出错小规模下Image2D和Buffer形式的索引转换，例如对应的错误起始的线程global_work_size对应的id分别为（0,0,1）为起始的计算错误的线程，用CPU模拟找到对应的Buffer形式的索引值；
单元测试的for循环规模大小，可以基于kernel实现的规模来设定，比方gemm的kernel每个线程计算4x8的C矩阵，那么单元测试的最大规模可以跑(4 + 3) x (8 + 7)刚好大于这个规模作为上界（检查少于和超出边界的情况），每次自增1，这样的小规模也方便后期排查错误时复现和手算；
单元测试中，随机初始化的值建议用整型，易于手算；
单元测试中，排除小规模的计算错误，可以在kernel中把所有线程打出来，即global_id以及各自分量所代表的参数值，如row、col、batch_idx；
单元测试中，加入宏LOOP_TEST定义来遍历各种从最小规模到最大规模，同时针对#else的情况，设置检查单个出错的bad case；
写kernel过程中，留下最naive的代码，用来自检查。此外，也可以在排查错误的过程中，通过把正确代码插入到错误kernel中，加入#if 0、#else来控制，确定哪部分出错，也可以用该方法来调试CPU的kernel；
边界问题带来的加载数据不正确，在保存时对边界写入处理得当，就不会出现。比方矩阵乘法c = a * b，加载a和b时，因为列优先（列主序列）存储的方式，若a和b的列数小于4，每次加载以矢量vload4加载4个，那么比方float4 b4x4[4]的第一个元素b4x4[0]中的b4x4[0].s1、b4x4[0].s2、b4x4[0].s3就会加载出错，矩阵b和c也有这个问题，但实际并没关系，因为存的时候处理好了边界。并不会将这些值写入结果中；
出现错误时，不加控制的打印，尤其是当实际业务中，计算规模很大，会导致打印处相关的代码的结果全部出来，难以排查，可以加if-else判断，控制打印哪一个线程便于排查问题（也方便手算）。
- 场景：开启PRINT_KERNEL的宏定义计算结果正确，关闭就部分错误。猜想是出现内存复写的情况，打印出内存地址，发现确实存在两个不同线程对同一地址的复写（前不久确实发现有两个线程出现内存地址复写的情况，有两个不同的线程，打印出来了相同的地址。原因是在保存结果矩阵的时候，对边界遍历的for的上界设置的条件不足导致的）；

        #ifdef PRINT_KERNEL
        if (row == 0 && col == 0 && bidx == 0) {
            for (int i = 0; i < 8; ++i) {
                printf("row = col = bidx = 0 initialize c8x4[%d] = %2v4hlf\n", i, c8x4[i]);
            }
        }
        #endif

更多方式可以看How to debug — MACE documentation

性能

adreno的local work size设置。上调试发现local work size默认的NullRange和我设置的{16, 16}，跑mobilentv1性能没啥差别，需要进一步查看，可以搜搜；
交换global work size的三个值的排布顺序试试看性能变化，搜一下；

Buffer Vs. Image

CUDA kernel Optimization & libraries

CUDA_Optimization_20200513.pdf

OCL error: implicit declarations are not allowed

>>> [ERROR] kernel build log: <source>:24:5: error: implicit declarations are not allowed
    AddDot(K, &a[0 * M + row], M, &b[(col+0) * N + 0], &c[(col+0)*N + row]);
    ^

<source>:25:5: error: implicit declarations are not allowed
    AddDot(K, &a[0 * M + row], M, &b[(col+1) * N + 0], &c[(col+1)*N + row]);
    ^

<source>:26:5: error: implicit declarations are not allowed
    AddDot(K, &a[0 * M + row], M, &b[(col+2) * N + 0], &c[(col+2)*N + row]);
    ^

<source>:27:5: error: implicit declarations are not allowed
    AddDot(K, &a[0 * M + row], M, &b[(col+3) * N + 0], &c[(col+3)*N + row]);
    ^

error: Compiler frontend failed (error code 59)

Optimize matrix transpose

This is a key step before matrix multiplication.

Resume previous matrix transpose program first and then follow the following guide next!

The matrix transpose program has been resumed.

Benchmark ACL and analyse its optimization strategy

Benchmark

// Ours GEMM, dont use ACL strategy
float4: 1024x1024x1024 0.259872 s  8.263633 GFLOPS
half4:  1024x1024x1024 0.145462 s 14.763193 GFLOPS
// ACL:
FP32:   1024x1024x1024 0.084823 s 25.3419 GFLOPS
FP16:   1024x1024x1024 0.039247 s 54.7705 GFLOPS
// OpenBLAS OMP 1 A72 
FP32:   1024x1024x1024 0.193891 s 11.075726 GFLOPS
FP16:    not support

The strategy ACL using and we dont have:

Matrix transpose for B and interleaving (according the comments in code: this operation will reshape A and make blocking ) for A;
Using native build-in functions: vload, vstore, fma, etc.

ACL GEMM

#if defined(COLS_B) && defined(ALPHA)
/** This OpenCL kernel is optimised for Midgard. It computes the matrix multiplication between matrix A (src0) and matrix B (src1)
 *  Matrix A and matrix B must be reshaped respectively with @ref gemm_interleave4x4_32bit and @ref gemm_transpose1x4 before running the matrix multiplication
 *
 * @attention The width of matrix B and the alpha's value need to be passed at compile time using -DCOLS_B and -DALPHA
 *
 * @param[in]  src0_ptr                           Pointer to the source matrix. Supported data types: F32
 * @param[in]  src0_stride_x                      Stride of the source matrix in X dimension (in bytes)
 * @param[in]  src0_step_x                        src_stride_x * number of elements along X processed per workitem(in bytes)
 * @param[in]  src0_stride_y                      Stride of the source matrix in Y dimension (in bytes)
 * @param[in]  src0_step_y                        src_stride_y * number of elements along Y processed per workitem(in bytes)
 * @param[in]  src0_offset_first_element_in_bytes The offset of the first element in the source matrix
 * @param[in]  src1_ptr                           Pointer to the source matrix. Supported data types: same as @p src0_ptr
 * @param[in]  src1_stride_x                      Stride of the source matrix in X dimension (in bytes)
 * @param[in]  src1_step_x                        src_stride_x * number of elements along X processed per workitem(in bytes)
 * @param[in]  src1_stride_y                      Stride of the source matrix in Y dimension (in bytes)
 * @param[in]  src1_step_y                        src_stride_y * number of elements along Y processed per workitem(in bytes)
 * @param[in]  src1_offset_first_element_in_bytes The offset of the first element in the source matrix
 * @param[out] dst_ptr                            Pointer to the destination matrix Supported data types: same as @p src0_ptr
 * @param[in]  dst_stride_x                       Stride of the destination matrix in X dimension (in bytes)
 * @param[in]  dst_step_x                         dst_gx_stride_x * number of elements along X processed per workitem(in bytes)
 * @param[in]  dst_stride_y                       Stride of the destination matrix in Y dimension (in bytes)
 * @param[in]  dst_step_y                         dst_gx_stride_y * number of elements along Y processed per workitem(in bytes)
 * @param[in]  dst_offset_first_element_in_bytes  The offset of the first element in the destination matrix
 */
__kernel void gemm_mm_interleaved_transposed_f32_midgard(IMAGE_DECLARATION(src0),
                                                         IMAGE_DECLARATION(src1),
                                                         IMAGE_DECLARATION(dst))
{
    /* src_addr.s0 = address of matrix A */
    /* src_addr.s1 = address of matrix B */

    /* Compute address for matrix A and B */
    int2 src_addr = (int2)(get_global_id(1), get_global_id(0)) * (int2)((src0_stride_y),
                                                                        (src1_stride_y));

    /* Add offset_first_element_in_bytes */
    src_addr = src_addr + ((int2)(src0_offset_first_element_in_bytes, src1_offset_first_element_in_bytes));

    /* Divide by 4 in order to get the src_addr in unit of float */
    src_addr = src_addr >> 2;

    /* Compute end row address for matrix B */
    int end_row_mtx_b = src_addr.s1 + COLS_B;

    /* Reset accumulators */
    float4 c00 = 0.0f;
    float4 c10 = 0.0f;
    float4 c20 = 0.0f;
    float4 c30 = 0.0f;

    for(; src_addr.s1 <= (end_row_mtx_b - 8); src_addr += (int2)(8, 8))
    {
        /* Load values from matrix A (interleaved) and matrix B (transposed) */
        float4 a0 = vload4(0, ((__global float *)src0_ptr) + src_addr.s0);
        float4 b0 = vload4(0, ((__global float *)src1_ptr) + src_addr.s1);

        c00 += (float4)a0.s0 * b0;
        c10 += (float4)a0.s1 * b0;
        c20 += (float4)a0.s2 * b0;
        c30 += (float4)a0.s3 * b0;

        /* Load values from matrix A (interleaved) and matrix B (transposed) */
        a0 = vload4(0, ((__global float *)src0_ptr) + src_addr.s0 + 4);
        b0 = vload4(0, ((__global float *)src1_ptr) + src_addr.s1 + 4);

        c00 += (float4)a0.s0 * b0;
        c10 += (float4)a0.s1 * b0;
        c20 += (float4)a0.s2 * b0;
        c30 += (float4)a0.s3 * b0;
    }

    for(; src_addr.s1 < end_row_mtx_b; src_addr += (int2)(4, 4))
    {
        /* Load values from matrix A (interleaved) and matrix B (transposed) */
        float4 a0 = vload4(0, ((__global float *)src0_ptr) + src_addr.s0);
        float4 b0 = vload4(0, ((__global float *)src1_ptr) + src_addr.s1);

        c00 += (float4)a0.s0 * b0;
        c10 += (float4)a0.s1 * b0;
        c20 += (float4)a0.s2 * b0;
        c30 += (float4)a0.s3 * b0;
    }

    /* Compute destination address */
    Image dst = CONVERT_TO_IMAGE_STRUCT(dst);

    /* Multiply by the weight of matrix product */
    c00 = c00 * (float4)ALPHA;
    c10 = c10 * (float4)ALPHA;
    c20 = c20 * (float4)ALPHA;
    c30 = c30 * (float4)ALPHA;

    /* Store 4x4 block */
    vstore4(c00, 0, (__global float *)(offset(&dst, 0, 0)));
    vstore4(c10, 0, (__global float *)(offset(&dst, 0, 1)));
    vstore4(c20, 0, (__global float *)(offset(&dst, 0, 2)));
    vstore4(c30, 0, (__global float *)(offset(&dst, 0, 3)));
}

How to adjust and query AMD GPU clock frequency?

I found a command using sudo aticonfig (from this link: [ubuntu] How to adjust GPU and graphic memory frequency?), below is the part of result of this command execution about frequency:

AMD Overdrive (TM) options

AMD Overdrive (TM) options:
  The following options are used to get and set current and peak, core
  and memory clock information as well as read the current temperature of
  adapters.  By using the "--adapter=" argument the AMD Overdrive (TM)
  options can be targeted to a particular adapter in a multi-adapter scenario.
  If no adapter is explicitly targeted the commands will be run on the Default
  adapter as indicated by the "--list-adapters" command
  --od-enable
        Unlocks the ability to change core or memory clock values by
        acknowledging that you have read and understood the AMD Overdrive (TM)
        disclaimer and accept responsibility for and recognize the potential
        dangers posed to your hardware by changing the default core or memory
        clocks
  --od-disable
        Disables AMD Overdrive(TM) set related aticonfig options.  Previously
        commited core and memory clock values will remain, but will not be set
        on X Server restart.
  --odgc, --od-getclocks
        Lists various information regarding current core and memory clock
        settings.
        Including: current and peak clocks
                   the theoretical range clocks can be set to
                   the current load on the GPU
  --odsc, --od-setclocks={NewCoreClock|0,NewMemoryClock|0}
        Sets the core and memory clock to the values specified in MHz
        The new clock values must be within the theoretical ranges provided
        by --od-getclocks.  If a 0 is passed as either the NewCoreClock or
        NewMemoryClock it will retain the previous value and not be changed.
        There is no guarantee that the attempted clock values will succeed
        even if they lay inside the theoretical range.  These newly set
        clock values will revert to the default values if they are not
        committed using the "--od-commitclocks" command before X is
        restarted
  --odrd, --od-restoredefaultclocks
        Sets the core and memory clock to the default values.
        Warning X needs to be restarted before these clock changes will take
        effect
  --odcc, --od-commitclocks
        Once the stability of a new set of custom clocks has been proven this
        command will ensure that the Adapter will attempt to run at these new
        values whenever X is restarted
  --odgt, --od-gettemperature
        Returns the temperature reported by any thermal sensors available on
        the adapter.

PowerXpress options

PowerXpress options:
  Warning: Configure PowerXpress in console mode is not always be guaranteed to have effect.
  --px-list-active-gpu
  --pxl
       List current activated GPU
  --px-dgpu
       Activate discrete GPU (High-Performance mode), must re-start X to take effect
  --px-igpu
       Activate integrated GPU (Power-Saving mode), must re-start X to take effect

Relative issues:

How to set Performance mode for AMD GPU · Issue #327 · clMathLibraries/clBLAS
clMathLibraries/clBLAS#327
Support AMD Embedded R-Series RX-416GD Radeon R6? · Issue #27 · ROCmSoftwarePlatform/MIOpen
ROCm/MIOpen#27

GPU使用率

高通平台
GPU使用率：

adb shell "cat /sys/class/kgsl/kgsl-3d0/gpubusy"

(前一个值/后一个值) * 100% = 使用率

MTK平台
GPU使用率：

adb shell "cat /d/ged/hal/gpu_utilization"

————————————————
版权声明：本文为CSDN博主「法迪」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/su749520/article/details/88829221