Coder Social home page Coder Social logo

packtpublishing / learn-cuda-programming Goto Github PK

View Code? Open in Web Editor NEW
919.0 919.0 219.0 69.21 MB

Learn CUDA Programming, published by Packt

License: MIT License

Makefile 8.15% Cuda 37.81% C++ 20.09% C 6.81% Python 23.78% MATLAB 0.12% R 0.06% Shell 2.99% Batchfile 0.20%

learn-cuda-programming's Introduction

Learn CUDA Programming

Learn CUDA Programming

This is the code repository for Learn CUDA Programming , published by Packt.

A beginner's guide to GPU programming and parallel computing with CUDA 10.x and C/C++

What is this book about?

Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning.

This book covers the following exciting features:

  • Understand general GPU operations and programming patterns in CUDA
  • Uncover the difference between GPU programming and CPU programming
  • Analyze GPU application performance and implement optimization strategies
  • Explore GPU programming, profiling, and debugging tools
  • Grasp parallel programming algorithms and how to implement them Scale GPU-accelerated applications with multi-GPU and multi-nodes Delve into GPU programming platforms with accelerated libraries, Python, and OpenACC Gain insights into deep learning accelerators in CNNs and RNNs using GPUs

If you feel this book is for you, get your copy today!

https://www.packtpub.com/

Instructions and Navigations

All of the code is organized into folders. For example, Chapter02.

The code will look like the following:

#include<stdio.h>
#include<stdlib.h>

__global__ void print_from_gpu(void) {
    printf("Hello World! from thread [%d,%d] \
        From device\n", threadIdx.x,blockIdx.x);
}

Following is what you need for this book: This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.

With the following software and hardware list you can run all code files present in the book (Chapter 1-10).

Software and Hardware List

Chapter Software required OS required
All CUDA Toolkit 9.x/10.x Linux
8 Matlab (later than 2010a) Linux
9 PGI Compilers 18.x/19.x Linux
10 NGC Linux

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Related product

Hands-On GPU-Accelerated Computer Vision with OpenCV and CUDA [Packt] [Amazon]

Get to Know the Authors

Jaegeun Han is currently working as a solutions architect at NVIDIA, Korea. He has around 9 years' experience and he supports consumer internet companies in deep learning. Before NVIDIA, he worked in system software and parallel computing developments, and application development in medical and surgical robotics fields. He obtained a master's degree in CSE from Seoul National University.

Bharatkumar Sharma obtained a master's degree in information technology from the Indian Institute of Information Technology, Bangalore. He has around 10 years of development and research experience in the domains of software architecture and distributed and parallel computing. He is currently working with NVIDIA as a senior solutions architect, South Asia.

Suggestions and Feedback

Click here if you have any feedback or suggestions.

learn-cuda-programming's People

Contributors

bharatk-parallel avatar dleunji avatar haanjack avatar packt-itservice avatar poojaparvatkar avatar romydias avatar techkang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

learn-cuda-programming's Issues

How to implement a complete LSTM network with cudnn?

This is a very good book!
I am building a semi supervised learning system with GA + LSTM. I need to build a multi-layer LSTM network with cudnn, which contains several LSTM layers and a softmax output layer. The LSTM network only needs inference function and no training. In the 10_deep_learning/03_rnn/rnn.cpp, you show how to create LSTM layers, but how to connect a LSTM layer and a softmax layer. There should be a density layer in the middle. The output of LSTM layer is 3D, and the input of density layer is 2D. How to convert?

wrong final reduction in all grid-stride loop examples

All examples in chapter 3 for reduction with grid-stride loop have following calls:

    reduction_kernel<<<n_blocks, n_threads>>>(g_outPtr, g_inPtr, size);
    reduction_kernel<<< 1, n_threads, n_threads * sizeof(float), 0 >>>(g_outPtr, g_inPtr, n_blocks);

I believe second one should have arguments (g_outPtr, g_outPtr, n_blocks) (g_outPtr twice). I think author of this blog post https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ agrees (note double "out" in (out, out, blocks)):

  deviceReduceKernel<<<blocks, threads>>>(in, out, N);
  deviceReduceKernel<<<1, 1024>>>(out, out, blocks);

Does the "softmax_loss_kernel" function only run in the first thread?

In the 42 line of chapter10/10_deep_learning/01_ann/src/loss.cu
gloabal void
softmax_loss_kernel(......)
{
int batch_idx = blockDim.x * blockIdx.x + threadIdx.x;
....
if (batch_idx >0)
return;
for(int c=0;c<num_outputs;c++)
loss += target[batch_idx * num_outputs + c] * logf(predict[batch_idx * num_outputs +c]);
workspace[batch_idx] = -loss;
....
}

since batch_idx must be zero here, why take the batch_idx to index the target,predict,and wrokspace?

sgemm.cu:6:10: fatal error: helper_functions.h: No such file or directory

Well, the error pretty well says it all. The code I want to compile is straight up missing the file helper_functions.h. Specifically, I am on page 39 of the book looking at the profiling example

nvcc -o sgemm sgemm.cu

and I can't compile the example code. This is the example under Chapter02/02_memory_overview/04_sgemm. I should also mention I git cloneed the repository: I am not running the Zip file code (it is really old).

I have a bit of a dispute with this code

Learn-CUDA-Programming/tree/master/Chapter02/02_memory_overview/01_sgemm)/sgemm.cu
segmm_gpu_kernel:
sum += A[i + row * K] * B[col + i * M];
I think it should be M

Code Deprecation and missing method calls in Chapter02/02_memory_overview/03_image_scaling

This code example has 2 issues. In the example Chapter02/02_memory_overview/03_image_scaling, I tried compiling it with the standard nvcc -o <exec name> <cudafile>.cu syntax nvcc -o runme image_scaling.cu, but I get a deprecation error.

image_scaling.cu(90): warning: conversion from a string literal to "char *" is deprecated

image_scaling.cu(90): warning: conversion from a string literal to "char *" is deprecated

image_scaling.cu: In function ‘int main(int, char**)’:
image_scaling.cu:82:67: warning: ‘cudaError_t cudaThreadSynchronize()’ is deprecated [-Wdeprecated-declarations]
         returnValue = (cudaError_t)(returnValue | cudaThreadSynchronize());
                                                                   ^
/usr/local/cuda-10.0/bin/../targets/x86_64-linux/include/cuda_runtime_api.h:947:46: note: declared here
 extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaThreadSynchronize(void);
                                              ^~~~~~~~~~~~~~~~~~~~~
/tmp/tmpxft_000010a4_00000000-10_image_scaling.o: In function `main':
tmpxft_000010a4_00000000-5_image_scaling.cudafe1.cpp:(.text+0x159): undefined reference to `get_PgmPpmParams(char*, int*, int*)'
tmpxft_000010a4_00000000-5_image_scaling.cudafe1.cpp:(.text+0x1ba): undefined reference to `scr_read_pgm(char*, unsigned char*, int, int)'
tmpxft_000010a4_00000000-5_image_scaling.cudafe1.cpp:(.text+0x535): undefined reference to `scr_write_pgm(char*, unsigned char*, int, int, char*)'
collect2: error: ld returned 1 exit status

There are actually 2 problems with this code. Code deprecation was only the first one.

I searched for a solution for the deprecation issue and I found one here. Basically, the command that should be used is to switch out cudaThreadSynchronize with cudaDeviceSynchronize on line 82 of image_scaling.cu.

The second issue has to do with the pgm file parsing method calls. For some reason, the main script cannot find the correct method calls by looking through the header file, so the second line in image_scaling.cu needs to change from #include"scrImagePgmPpmPackage.h" to #include"scrImagePgmPpmPackage.cpp". Only then does it find the method definitions.

TL;DR
The code works after some edits, and I wanted to point out what those edits were.

I am not an expert on importing directives, but I hypothesize that the original cpp syntax is correct, but the nvcc compiler is looking for header/class files in the wrong order.

Unused texture reference?

So just like most people here, I am new to CUDA, and because of that I may be totally wrong. However the texture reference declared here:

texture<unsigned char, 2, cudaReadModeElementType> tex;

doesn't seem to be used, instead we use the texture obj we create in main for texture fetching. So I was wondering if this line is even necessary. I tried commenting it out and running it, and it seemed to compile and run just fine, with the expected results. Then I did a tiny amount of research and it looks like we would use it only if we were binding to some device memory, which again we are not. So this does seem to me to be unused and not needed.

Again I may be wrong, correct me if I am.

Thanks for your time!

How to build ResNet network to train with cuda

Although deep learning frameworks(e.g., TF, Pytorch,...) are commonly used to train models, I still want to try to build a Resnet model using Cudnn/ Cublas library for the deep learning training.
Could you give me some suggestions about how to create classic deep learning network(e.g., ResNet, ...) for cuda.
Thank you very much!
For Resnet18, Pad/AddV2/Reshape/Mean/BiasAdd Ops are still required to implement.

cannot compile chapter 10 on linux

bug
So I was jumping around the repository to see if I could at least try and run the code and I ran into another issue.
Specifically, I tried compiling the chapter 10 code by cding to this location:

blah/Learn-CUDA-Programming/Chapter10/10_deep_learning/right here

and running the make command, I get this error:

~/nvidiastuff/learncuda/Learn-CUDA-Programming/Chapter10/10_deep_learning$ make
make[1]: Entering directory '/home/thatrobotguy/nvidiastuff/learncuda/Learn-CUDA-Programming/Chapter10/10_deep_learning/01_ann'
/usr/local/cuda/bin/nvcc -ccbin g++ -I/usr/local/cuda/samples/common/inc -I/usr/local/cuda/include -m64 -g -std=c++11 -G --resource-usage -Xcompiler -rdynamic -Xcompiler -fopenmp -rdc=true -lnvToolsExt -I/usr/local/cuda/samples/common/inc -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcublas -lcudnn -lgomp -lcurand -gencode arch=compute_70,code=sm_70 -c train.cpp -o obj/train.o
nvcc warning : Resource usage is not shown as the final resource allocation is not done.
Assembler messages:
Fatal error: can't create obj/train.o: No such file or directory
Makefile:29: recipe for target 'obj/train.o' failed
make[1]: *** [obj/train.o] Error 1
make[1]: Leaving directory '/home/thatrobotguy/nvidiastuff/learncuda/Learn-CUDA-Programming/Chapter10/10_deep_learning/01_ann'
Makefile:7: recipe for target '01_ann/Makefile.ph_build' failed
make: *** [01_ann/Makefile.ph_build] Error 2

I know I have GCC set to version 7 since I am running CUDA 10.0 on Ubuntu 18.04, but I don't think my machine is the problem since all of the software paths are known to the nvcc compiler (added in my ~/.bashrc file).

Porting the repository to a single unified CMake build structure

TL;DR

Please port the code to use CMake exclusively: code compilation is right now a bear to deal with.

Long story:

I have been poking around this code repository trying out samples and I have run into multiple issues getting examples to compile. This repository would be a million times cleaner if all of the code was compiled by a single cmake command in the root directory. I should also mention that mixing of cuda and non-cuda projects often occurs through the use of the CMake build tools (ROS is a big one) and this code would be more usable and portable (say, to Windows) if CMake handled all of the code compilation. I was going to convert it myself to CMake (which looks to be a monumental task looking at all of the files) but there is so much broken and/or not-compiling code I would not know where to start.

I should say that I bought the CMake cookbook from Packt, and I think that this book needs to use the principles mentioned in that book to make this repository usable. This would also allow for a single folder to have all of the compiled files so that the .gitinore file can ignore them and therefore allow to more easily create pull requests with this repository should changes need to be made (as GCC and Ubuntu get updated, etc.). I should mention that I have made pull requests with other authors' repositories simply because the repository is so well organized it took me 10 minutes to do it, so I would recommend that this repo receive the same TLC.

I would normally call this a feature request, but the code is so broken it would solve a lot of problems if this repo used CMake and Make together rather than just Make by itself.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.