Coder Social home page Coder Social logo

qianxinchun / simplegemm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from andylolu2/simplegemm

0.0 0.0 0.0 38 KB

The simplest but fast implementation of matrix multiplication in CUDA.

License: MIT License

Cuda 71.55% CMake 2.74% Jupyter Notebook 25.70%

simplegemm's Introduction

simpleGEMM

img-uNPCSD5UDSQHHgRJpcDy7Lzf
Generated by DALL·E 3

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Benchmark against standard implementations (see main.cu and reference.cu):

$ ./main
Usage: ./main M N K iters

$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138

$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532

$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095

The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.

Quick start

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab: Open In Colab

Compile the main.cu file:

nvcc \
    --include-path ./ \
    --include-path cutlass/include \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --expt-relaxed-constexpr \
    -forward-unknown-to-host-compiler \
    -std=c++17 \
    -O3 \
    -o build/main \
    main.cu

And run!

$ ./build/main
Usage: ./main M N K iters

$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

You can also build with CMake (a better option for development):

$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main 
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters

What's missing

The code trades off generality for simplicity:

  • Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
  • Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
  • Assumes (asserts) the inputs are divisible by the block size.
  • Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
  • Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
  • Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.

simplegemm's People

Contributors

andylolu2 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.