Coder Social home page Coder Social logo

cuda_gemm's Introduction

introduction

A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.

C = alpha * A * B + beta * C

algorithm

located in src/cuda/

  • MatrixMulCUDA
    • one element of C is assigned one thread
    • global memory coalesce of B
  • MatrixMulCUDA1
    • texture load
  • MatrixMulCUDA2
    • one 4 * 4 grid of C is assigned one thread
  • MatrixMulCUDA3
    • vectorized A B load
  • MatrixMulCUDA4
    • vectorized C store
  • MatrixMulCUDA5
    • block sparse version
  • MatrixMulCUDA6
    • vectorized A B load coalesce
  • MatrixMulCUDA7
    • warp shuffle to enable C store coalesce
  • MatrixMulCUDAQuantize8bit
    • 8 bit non-uniform quantized matmul

experiments

located in benchmark/

  • benchmark_dense
    • Compare My Gemm with Cublas
  • benchmark_sparse
    • Compare My block sparse Gemm with Cusparse
  • benchmark_quantization_8bit
    • Compare My Gemm with Cublas
  • benchmark_quantization
    • Compare My Gemm with My quantized non-uniform 8 bit Gemm

TODO

  • (MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
  • (MatrixMulCUDA8) double buffering

run

mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh

Note

  • sparsity约为1%的时候, cusparse的性能可以超越cublas
  • 合理分配寄存器 尽可能让参数在编译器确定节省计算资源和寄存器数目

cuda_gemm's People

Contributors

cjkkkk avatar linbinskn avatar xuehui1991 avatar zhacmsft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cuda_gemm's Issues

Wrong results when M/N/K can not be divided by BM/BN/BK

Hi, I found that if I try $M = N = K = 100$, the result is wrong:

$ ./builds/benchmark_dense 100 100 100
Grid Dim: (2 2) Block Dim: (16 16)
My gemm Performance= 38.52 GFlop/s, Time= 0.052 msec, Size= 2000000 Ops,
CuBlas Performance= 179.52 GFlop/s, Time= 0.011 msec, Size= 2000000 Ops,
Error! Matrix[00000]=10154469497706623404430887145349054464.00000000, ref=1464674637695684277803204596465664.00000000 error term is > 1.000000E-06
Result= FAIL
ratio= 0.214553

But for larger $M = N = K = 1000$, the result is again correct

$./builds/benchmark_dense 1000 1000 1000
Grid Dim: (16 11) Block Dim: (16 16)
My gemm Performance= 10464.73 GFlop/s, Time= 0.191 msec, Size= 2000000000 Ops,
CuBlas Performance= 15660.82 GFlop/s, Time= 0.128 msec, Size= 2000000000 Ops,
Result= PASS
ratio= 0.668211

wrong results with random inputs

Hello,

thanks for your great work! When I set the input matrices to random value, like

`for( int i = 0; i < M * K; i++ ) {
h_A[i] = static_cast (rand()) / static_cast (RAND_MAX);
}

for( int i = 0; i < K * N; i++ ) {
  h_B[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);

}`

I got unpassed verification by using benchmark_dense.cu. Do you have any idea about this?

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.