simpleGEMM


Generated by DALL·E 3

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Benchmark against standard implementations (see main.cu and reference.cu):

$ ./main
Usage: ./main M N K iters

$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138

$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532

$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095

The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.

Quick start

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab:

Compile the main.cu file:

nvcc \
    --include-path ./ \
    --include-path cutlass/include \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --expt-relaxed-constexpr \
    -forward-unknown-to-host-compiler \
    -std=c++17 \
    -O3 \
    -o build/main \
    main.cu

And run!

$ ./build/main
Usage: ./main M N K iters

$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

You can also build with CMake (a better option for development):

$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main 
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters

What's missing

The code trades off generality for simplicity:

Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
Assumes (asserts) the inputs are divisible by the block size.
Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.

qianxinchun / simplegemm Goto Github PK

simplegemm's Introduction

simpleGEMM

Quick start

What's missing

simplegemm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent