enccs / gpu-programming Goto Github PK

Meta-GPU lesson covering general aspects of GPU programming as well as specific frameworks

Home Page: https://enccs.github.io/gpu-programming/

License: Creative Commons Attribution 4.0 International

Makefile 3.31% CSS 1.00% Python 6.24% Batchfile 0.28% C 8.80% Fortran 30.84% C++ 24.67% Cuda 17.27% Shell 4.22% Julia 3.36%

gpu-programming's People

Contributors

Stargazers

Watchers

Forkers

al42and agoetz mahsanchez linguist89 heikonenj hichamagueny panadestein weilinscenccs acorbat begeospatial jobelund christopherwoodall aresbit dennispan edmondium goswamig rommeldb rayanral ro99

gpu-programming's Issues

core-omp.cpp missing

linked from 13-examples.rst

Choose/ provide the source for (reduction? -- TBD) example

Heat equation example(s) will probably come from current workshops, what about reduction one?

episode 4: GPU programming concepts

confusing point: why kernels? why no explicit loops?
drive home the logic with threadIds etc
memory: can't expect to have non-contiguous arrays and get performance
don't expect to get easy performance! First attempt might well be slower than CPU version

Game of Life as a stencil demo example

The idea is to try and code an understandable, simple example for CPU / GPU parallelization from scratch, without carrying mental load from the earlier, already GPU-ported examples, and the suggestion is Game Of Life.
Sketching OpenMP-offload, CUDA or SYCL and Python or Julia variants should be enough to see if the concept presents itself well on different types of GPU frameworks.

for each long paragraph of text, include box with bullet point summary

using some new admonition box with distinct color. The box can be used while teaching rather than looking at a long paragraph of text

episode 5: GPU software environments

installation of dependencies
NVIDIA, ROCm, OneAPI, Metal
compilers

HIP-python with a bit more detail

The HIP-python part will complement the section "High-level language support". It will be addressed with a bit more detail. Some initial thoughts are:

-Accessing the GPU’s properties from HIP-python
-Managing/allocating memory from HIP-python
-Compiling kernels from HIP-python
-Launching kernels from HIP-python
-Creating Streams and Events from HIP-python
-Using HIP-python’s library e.g. hipBLAS
-...

episode 2: Introduction to GPU hardware

typical image showing CPU vs GPU architecture
overview of the vendor landscape
generations (computing capabilities)

Examples Ep: timings and comparison / discussion

To be completed after running example codes on the host cluster.

One question so far is: is it OK/ enough to show successful GPU load and not necessarily observable speedup?
At least for OpenMP offload (and possibly Numba-CUDA), explicit data placement is also needed to achieve wall times that are better than CPU counterparts. Default data rules (=less extra code) work and engage GPU, but are slower.

episode 3: Which problems fit on a GPU?

why and when?
What scale is worth it?
What problems are worth it?

LICENSE file is a concatenation of several licenses

Apparently, one should have been chosen by the templating engine, but it did not do anything

https://github.com/ENCCS/gpu-programming/blob/bfbc1fd6c21788bcaebf877fa745c6a47e06ed64/LICENSE

{% if cookiecutter.lesson_license == 'CC-BY-4.0' -%}
Attribution 4.0 International
[....]

Creative Commons may be contacted at creativecommons.org.
{% elif cookiecutter.lesson_license == 'CC-BY-SA-4.0' %}
Attribution-ShareAlike 4.0 International
....
....

Creative Commons may be contacted at creativecommons.org.
{% endif %}

Same with LICENSE.code

Examples Ep: add CUDA variant

If someone is enthusiastic for Kokkos, that could be done as well, but so far there are

directive based (OpenMP offloading),
language based (Python-numba in progress),
"portable" kernel based (SYCL) implementations,

but no CUDA or HIP. So CUDA would be useful.

collected resources to reuse

Good general overview of concepts: https://enccs.github.io/HPDA-Python/GPU-computing/
new lesson with focus on python but also CUDA: https://carpentries-incubator.github.io/lesson-gpu-programming/
very good overview of frameworks: https://documentation.sigma2.no/tutorials/guides
Oak Ridge tutorials: https://github.com/olcf-tutorials/openmp_offloading

episode 6: Preparing code for GPU porting

modular code makes it easier

episodes 8-N: Diving into the frameworks

(prototype outline for each framework, each episode is one framework or group of frameworks)

introducing two example problems: heat equation and reduction
showing example problems in all different frameworks, with detailed explanation of steps

OpenSYCL/hipSYCL project name

The project had to change its name again (AdaptiveCpp/AdaptiveCpp#999).

This issue is a reminder to update the text once a new name is chosen.

In the meantime, "hipSYCL" is recommended.

Examples Ep: add Julia variant

Is there enough by now to "translate" to Julia with little complication?
(GPU-ifying Python variant may be needed first.)

move content from slides to lesson episode

for episode 8-multiple_gpu
if something in the slides is missing from the episode

also consider whether exercise code examples should be added into code boxes

code examples for julia programming

For mac M2 GPU, if the Float64 type does not work, switch it to Float32.
For the write your own kernels, the grid in one block cell should be groups

episode 7: GPU programming options

low-level, directive-based, kernel-based, ... options.
low-level: CUDA, HIP
directive-based: OpenMP, OpenACC (incremental approach)
language-based: Python-numba, Julia-GPU, SYCL, TensorFlow/Pytorch

use new :abbr: directive to offer alternative definitions/terminology for terms

example usage is :abbr:thread in 1-gpu-history.rst - whose multiple definitions need to be fixed in conf.py

episode 1: What are GPUs and why are they used in computation?

history
what caused/ motivated GPU-like solutions to initial certain (general programming) problems
why everyone has to get started with GPUs (>95% of supercomputing capacity in near future)

consider using different admonitions for non-exercise code examples

e.g. use typealong instead of challenge (or exercise), like in https://enccs.github.io/gpu-programming/8-multiple_gpu/

Issue in GPU concepts, NVIDIA architectures and branch divergence in kernel

On 4-gpu-concepts.rst:124:

On some architectures, all members of a :abbr:`warp` have to execute the 
same instruction, so-called "lock-step" execution. This is done to achieve 
higher performance, but there are some drawbacks. If a an **if** statement 
is present inside a warp will cause the warp to be executed more than once, 
one time for each branch. On architectures without lock-step execution, such 
as NVIDIA Volta (e.g., GeForce 16xx-series) or newer, warp divergence is less costly.

To my understanding, GeForce 16xx-series is not an example of a NVIDIA Volta or newer. This might need be verified and potentially modified. I would also maybe clarify the claim about the if-statement; from what I've understood, there would be branch divergence only if the if-statement is evaluated at runtime (not templated branch), and multiple threads withing a single warp actually execute different branch of the if statement.

Kernel models eps.: organize examples to be (at least likely) runnable

The existing set of 4 examples for all 5 kernel-based models is very nice, but it could be transferred from .rst files to a subdirectory of content/examples/ with build/ run instructions where available (or some comments in the README.md).

each episode should have "see also" section

at the very least to point to other ENCCS/NRIS/CSC/etc tutorials which go into more depth

Update first exercise from Directives ep.

The "Exercise: Change the levels of parallelism" from the "Directive-based models" episode should be updated to state more clearly what the users might want to change, where, and what is the expected output/ changes in output. Because many participants are confused here.

Examples Ep: adapt Numba-Python for GPU

Current code is accelerated by Numba JIT (almost to C++ performance), but is not offloaded to GPU yet.
Also, an alternative of main loop is in a comment, because one is faster with (CPU-)JIT, the other without it (using NumPy vectorization), and I'm not yet sure which form the GPU kernel will take.

lumi accounts

lesson learned not about the lesson material but about managing workshop projects on LUMI:
it'll probably be better to send direct individual puhuri invite links to participants, because the automatic emails sent via puhuri often get lost, overlooked by participants, stuck in spam filters etc

Homogenize source code used for stencil (heat equation) example

Multiple workshops contain source code for the heat equation example that ostensibly does the same (even with the same implementation of visualization, i. e., PNG writer) but very likely is already slightly different in each case.
Many of the variants have their roots in C, but some or all of these are actually interoperable with C++. So I'd like to get an overall feel what makes the most sense: maintain base/ serial code in C (as it was written in 2014 or earlier), in C++ (which offers useful programming paradigms not available in C) or both?