cuda-mode / lectures Goto Github PK

Material for cuda-mode lectures

License: Apache License 2.0

Python 10.06% Cuda 2.33% Makefile 0.02% Jupyter Notebook 87.38% C++ 0.13% CMake 0.07%

lectures's Introduction

Supplementary Material for Lectures

discord.gg/cudamode The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link) YouTube Channel

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Video
Date: 2024-01-13, Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Video
Date: 2024-01-20, Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Video
Date: 2024-01-27, Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Video
Date: 2024-02-03, Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Video
Date: 2024-02-10, Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Video
Date: 2024-02-17, Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Video
Date: 2024-02-25, Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Video
Date: 2024-03-09, Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Video
Date: 2024-03-09, Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Video
Date: 2024-03-16, Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Video
Date: 2024-03-23, Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Video
Date: 2024-03-30, Speaker: Thomas Viehmann

Lecture 13: Ring Attention

Video
Date: 2024-04-06, Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Video
Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Date: 2024-04-20, Speaker: Eric Auld

Lecture 16: On Hands profiling

Date: 2024-04-27, Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Date: 2024-04-27, Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Date: 2024-05-04, Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Date: 2024-05-11, Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Date: 2024-05-18, Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Date: 2024-05-25, Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Date: 2024-05-31, Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Date: 2024-06-01, Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Date: 2024-06-07, Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Date: 2024-06-08, Speaker: Jake Hemstad & Georgii Evtushenko
Slides

lectures's People

Stargazers

Watchers

Forkers

orefaleoluwayinka techthiyanes khankindle evelynmitchell lancerts engelberger yparvej alexandor91 artste michalbiesek mostefaben nichitadiaconu jenochs keremturgutlu omyllymaki noklam cudawarped thanhpham1987 melvinebenezer nickgannon10 abhi-glitchhg elefhead ssghost mosthumble bharadwaj1098 trpquo yihongwang001 aaditya29 nikanrican rajeev921 ammaryasirnaich the-praxs axellabs batuhanhangun ckfanzhe credwood 3lvy rishabh135 nazar-primac mokpolar mustaphau guangceleo the-true-hooha kc-decoder zpershuai theyorubayesian iali61 xianghao-wang azsh1725 gogog01-29-2021 zacniewski philipbutler stemmr shahinsharifi artbataev jcaip gfvvz heyimjonas asoler2004 eycab topcoderchen deepankarsharma mekongdelta-mind umerha huudatdo gokayfem lskhappychild nmh4598 imss27 lokeshsubramany dsjohns2 high-in-entropy kelvinthedrugger heekyungyoon kapilsh vedantpoduval anh-vunguyen thomascherickal ajitantony jazracherif htdung167 zineos srikantt manickavela29 aflah02 vaibtan zhumingyi94 guohaoyu110 alexzhang13 muhtasham trungnq0202 pritamrp tomekwier shtoshni shubhampachori12110095 junkin insop ashishd juelianqvq optimize-mode

lectures's Issues

Names of the c packages to be included are missing

In the nb for lecture 3, the names of the c packages to be included are missing - see picture

Some images from Lecture 14 notebook are missing

Hello,

@UmerHA , thanks a lot for your awesome notebook!
It seems some of the images are missing from the notebook, I'm wondering if you could update them.
Thanks in advance!

Example:

Lecture3 - Issue with building CUDA code for RGB to Grayscale

Hi,

I am new to CUDA, so this might probably be a beginners issue!

I am following Jeremy's tutorial on building/writing CUDA code for rgb->gray scale and followed his notebook, but am failing when calling module = load_cuda(cuda_src, cpp_src, ['rgb_to_grayscale'], verbose=True)

Here's the trace stack:

Using /hdd4/srinath2/.cache/torch_extensions/py312_cu121 as PyTorch extensions root...
Creating extension directory /hdd4/srinath2/.cache/torch_extensions/py312_cu121/inline_ext...
Detected CUDA files, patching ldflags
Emitting ninja build file /hdd4/srinath2/.cache/torch_extensions/py312_cu121/inline_ext/build.ninja...
Building extension module inline_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=inline_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/TH -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/THC -isystem /hdd4/srinath2/.conda/envs/llm_env/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /hdd4/srinath2/.cache/torch_extensions/py312_cu121/inline_ext/cuda.cu -o cuda.cuda.o 
FAILED: cuda.cuda.o 
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=inline_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/TH -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/THC -isystem /hdd4/srinath2/.conda/envs/llm_env/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /hdd4/srinath2/.cache/torch_extensions/py312_cu121/inline_ext/cuda.cu -o cuda.cuda.o 
cc1plus: fatal error: cuda_runtime.h: No such file or directory
compilation terminated.
[2/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/TH -isystem /hdd4/srinath2/.conda/envs/llm_env/lib/python3.12/site-packages/torch/include/THC -isystem /hdd4/srinath2/.conda/envs/llm_env/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /hdd4/srinath2/.cache/torch_extensions/py312_cu121/inline_ext/main.cpp -o main.o 
ninja: build stopped: subcommand failed.
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?a2bafd1a-0c2d-45d8-a6c5-9afaf39345f4)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
File ~/.conda/envs/llm_env/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2096, in _run_ninja_build(build_directory, verbose, error_prefix)
   2095     stdout_fileno = 1
-> 2096     subprocess.run(
   2097         command,
   2098         stdout=stdout_fileno if verbose else subprocess.PIPE,
   2099         stderr=subprocess.STDOUT,
   2100         cwd=build_directory,
   2101         check=True,
   2102         env=env)
   2103 except subprocess.CalledProcessError as e:
   2104     # Python 2 and 3 compatible way of getting the error object.

File ~/.conda/envs/llm_env/lib/python3.12/subprocess.py:571, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    570     if check and retcode:
--> 571         raise CalledProcessError(retcode, process.args,
    572                                  output=stdout, stderr=stderr)
    573 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
...
   2110 if hasattr(error, 'output') and error.output:  # type: ignore[union-attr]
   2111     message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}"  # type: ignore[union-attr]
-> 2112 raise RuntimeError(message) from e

RuntimeError: Error building extension 'inline_ext'

Please let me know how to debug/proceed further.

Thanks a ton for this resource :)

can you share exact notebook as presented in video

great material thanks
but can you share exact notebook as presented in video
for example presented at 13:19 is not in gihub notebook
https://youtu.be/nOxKexn3iBo?t=798

motion pictures 📽️

🎞️ 🤷

where them video links for the lectures (if recording available of course)

Code license

Just stumbled upon this amazing repo. Thanks so much for sharing this!

Someone recommended the CUDA benchmarking code from lecture 1

def time_pytorch_function(func, *input, num_repeats = 1_000):
    # CUDA IS ASYNC so can't use python time module
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(5):
        func(*input)
    torch.cuda.synchronize()

    start.record()
    for _ in range(num_repeats):
        func(*input)
        torch.cuda.synchronize()
    end.record()
    return start.elapsed_time(end) / num_repeats

and I was wondering, in general, if the code is open source? Could you perhaps add a license to the repo to clarify? Thanks!