CUDA Programming

The goal of this document is only to make it easier for new developpers to undestand the overall CUDA paradigm and NVIDIA GPU features.

This document is just a high level overview of CUDA features and CUDA programming model. It is probably simpler to read it before installing the CUDA Toolkit, reading the CUDA C Programming guide, CUDA Runtime API and code samples.

For a complete and deep overview of the CUDA Toolkit and CUDA Runtime API and CUDA Libraries, please consult NVIDIA websites:

The CUDA Zone:
- https://developer.nvidia.com/cuda-zone
Join The CUDA Registered Developer Program:
- https://developer.nvidia.com/registered-developer-program
Information about the CUDA Toolkit:
- https://developer.nvidia.com/cuda-toolkit
Download the latest CUDA Toolkit (Mac OSX / Linux x86 / Linux ARM / Windows):
- https://developer.nvidia.com/cuda-downloads

Introduction
Thread Hierarchy
- Kernel
- Thread
- Warp
- Block
- Grid
Memory
Asynchronous Concurrency Execution
- [Concurrent Data Access](#Concurrent Data Access)
  - Threads Synchronization
  - Atomic Functions
- Concurrent Execution between Host and Device
- Concurrent Data Transfers
- Concurrent Kernel Execution
- Overlap of Data Transfer and Kernel Execution
- Streams
  - Callbacks
  - Events
- Dynamic Parallelism
- Hyper-Q
Multi-Device System
- Stream and Event Behavior
- GPU Direct

Annexes

CUDA Libraries
- cuBLAS
- cuSPARSE
- cuSOLVER
- cuFFT
- cuRAND
- cuDNN
- NPP
Atomic Functions

Introduction

GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control.

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread). The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism extensively through simultaneous hardware multithreading. Unlike CPU cores they are issued in order however and there is no branch prediction and no speculative execution.

Terminology

HOST: The CPU and its memory (Host Memory)
DEVICE: The GPU and its memory (Device Memory)
COMPUTE CAPABILITY: The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.
- The compute capability version comprises a major and a minor version number (x.y):
  - Devices with the same major revision number are of the same core architecture:
    - 1 = Telsa architecture
    - 2 = Fermi architecture
    - 3 = Kepler architecture
    - 5 = Maxwell architecture
  - The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.

Notes:

The compute capability version of a particular GPU should not be confused with the CUDA version (e.g., CUDA 5.5, CUDA 6, CUDA 6.5, CUDA 7.0), which is the version of the CUDA software platform.
The Tesla architecture (Compute Capability 1.x) is no longer supported starting with CUDA 7.0.

Architecture specifications Compute capability (version)	1.0	1.1	1.2	1.3	2.0	2.1	3.0	3.5	5.0	5.2
Number of cores for integer and floating-point arithmetic functions operations	8	8	8	8	32	48	192	192	128	128


	MacBook Pro 15' NVIDIA GeForce GT 750M 2 Giga 384 cores with 2 GB
	NVIDIA Tesla GPU Accelerator K40 2.880 cores with 12 GB

CUDA Software Architecture

CUDA is composed of two APIs:

A low-level API called the CUDA Driver API,
A higher-level API called the CUDA Runtime API that is implemented on top of the CUDA driver API.

These APIs are mutually exclusive: An application should use either one or the other.

The CUDA runtime eases device code management by providing implicit initialization, context management, and module management.
In contrast, the CUDA driver API requires more code, is harder to program and debug, but offers a better level of control and is language-independent since it only deals with cubin objects. In particular, it is more difficult to configure and launch kernels using the CUDA driver API, since the execution configuration and kernel parameters must be specified with explicit function calls. Also, device emulation does not work with the CUDA driver API.

The easiest way to write an application that use GPU acceleration is to use existing CUDA libraries if they correspond to your need: cuBLAS, cuRAND, cu FFT, cuSOLVER, cuDNN, ...

If existing libraries do not fit your need then you must write your own GPU functions in C language ( CUDA call them kernels, and CUDA C program have ".cu" file extension ) and use them in your application by using CUDA Runtime API, and linking CUDA Runtime library and your kernels functions to your application.

Versioning and Compatibility

The driver API is backward compatible, meaning that applications, plug-ins, and libraries (including the C runtime) compiled against a particular version of the driver API will continue to work on subsequent device driver releases.
The driver API is not forward compatible, which means that applications, plug-ins, and libraries (including the C runtime) compiled against a particular version of the driver API will not work on previous versions of the device driver.
Only one version of the CUDA device driver can be installed on a system.
All plug-ins and libraries used by an application must use the same version of:
- any CUDA libraries (such as cuFFT, cuBLAS, ...)
- the associated CUDA runtime.

Thread Hierarchy

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions.

The CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is the case, for example, when the kernels execute on a GPU and the rest of the Golang program executes on a CPU.

Calling a kernel function from the Host launch a grid of thread blocks on the Device:

Kernel

A kernel is a special C function defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>> execution configuration syntax. Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.

The following sample code adds two vectors A and B of size 16 and stores the result into vector C:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C) {
  int i = threadIdx.x;
  C[i] = A[i] + B[i];
}

int main() {
   ...
   // Kernel invocation of 1 Block with 16 threads
   VecAdd<<< 1, 16 >>>(A, B, C);
   ...
}

Here, each of the 16 threads that execute VecAdd() performs one pair-wise addition:

Thread

For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.

The index of a thread and its thread ID relate to each other in a straightforward way:

For a one-dimensional block of size (Dx), the thread ID of a thread of index (x) are the same:
- = x
For a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is:
- = x + y*Dx
For a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is:
- = x + y*Dx + z Dx*Dy.

Note: Even if the programmer want to use 1, 2 or 3 dimensions for his data representation, Memory is still a 1 dimension vector at the end ...

The following code adds two matrices A and B of size NxN and stores the result into matrix C:

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
  int i = threadIdx.x;
  int j = threadIdx.y;
  C[i][j] = A[i][j] + B[i][j];
}

int main() {
  ...
  // Kernel invocation with one block of N * N * 1 threads
  dim3 threadsPerBlock(N, N);
  MatAdd<<< 1, threadsPerBlock >>>(A, B, C);
  ...
}

Warp

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology.

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

The threads of a warp that are on that warp's current execution path are called the active threads, whereas threads not on the current path are inactive (disabled). Threads can be inactive because they have exited earlier than other threads of their warp, or because they are on a different branch path than the branch path currently executed by the warp, or because they are the last threads of a block whose number of threads is not a multiple of the warp size.

Block

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks.

The number of threads per block and the number of blocks per grid specified in the <<<...>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.

Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional index accessible within the kernel through the built-in blockIdx variable.
The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.

// Kernel definition
__global__ void fooKernel( ... )
{
  unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
  unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
}

int main()
{
  ...
  // Kernel invocation with 3x2 blocks of 4x3 threads
  dim3 dimGrid(3, 2, 1);
  dim3 dimBlock(4, 3, 1);
  fooKernel<<< dimGrid, dimBlock  >>>( ... );
  ...
}

Consequence of fooKernel<<< dimGrid, dimBlock >>> invocation is the launch of a Grid consisting of 3x2 Blocks with 4x3 Threads in each block:

Grid

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed.

Automatic Scalability

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores, enabling programmers to write code that scales with the number of cores.

A GPU is built around an array of Streaming Multiprocessors (SMs). A Grid is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the Grid in less time than a GPU with fewer multiprocessors.

Technical Specifications per Compute Capability	1.1	1.2 - 1.3	2.x	3.0 - 3.2	3.5	3.7	5.0	5.2

Grid of thread blocks
Maximum dimensionality of a grid	2	2	3	3	3	3	3	3
Maximum x-dimension of a grid	65535	65535	65535	2^31-1	2^31-1	2^31-1	2^31-1	2^31-1
Maximum y- or z-dimension of a grid	65535	65535	65535	65535	65535	65535	65535	65535

Thread Block
Maximum dimensionality of a block	3	3	3	3	3	3	3	3
Maximum x- or y-dimension of a block	512	512	1024	1024	1024	1024	1024	1024
Maximum z-dimension of a block	64	64	64	64	64	64	64	64
Maximum number of threads per block	512	512	1024	1024	1 024	1.024	1 024	1024

Per Multiprocessor
Warp size	32	32	32	32	32	32	32	32
Maximum number of resident blocks	8	8	8	16	16	16	32	32
Maximum number of resident warps	24	32	48	64	64	64	64	64
Maximum number of resident threads	768	1024	1536	2048	2048	2048	2048	2048

Memory

The GPU card contains its own DRAM memory separatly from the CPU's DRAM.

A typical CUDA compute job begins by copying data to the GPU, usually to global memory.
The GPU then asynchronously runs the compute task it has been assigned.
When the host makes a call to copy memory back to host memory, the call will block until all threads have completed and the data is available, at which point results are sent back.

Host Memory

Page-Locked Host Memory (Pinned Memory)

The CUDA runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory):

Using page-locked host memory has several benefits:

Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices.
On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory.
On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining.

Device Memory

CUDA threads may access data from multiple memory spaces during their execution:

Local memory
- Each thread has private local memory (per-thread local memory).
- Local memory is so named because its scope is local to the thread, not because of its physical location. In fact, local memory is off-chip (=DRAM). Hence, access to local memory is as expensive as access to global memory. In other words, the term local in the name does not imply faster access.
Shared memory
- Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block (per-block shared memory).
- Because it is on-chip, shared memory has much higher bandwidth and lower latency than local and global memory.
Global memory
- All threads have access to the same global memory (=DRAM).

Technical Specifications per Compute Capability	1.1	1.2 - 1.3	2.x	3.0 - 3.2	3.5	3.7	5.0	5.2

Registers
Number of 32-bit registers per multiprocessor	8k	16k	32k	64k	64k	128k	64k	64k
Maximum number of 32-bit registers per thread block			32k	64k	64k	64k	64k	64k
Maximum number of 32-bit registers per thread	128	128	63	63	255	255	255	255

Local Memory
Maximum amount of local memory per thread	16 KB	16 KB	512 KB	512 KB	512 KB	512 KB	512 KB	512 KB

Shared Memory
Maximum amount of shared memory per multiprocessor	16 KB	16 KB	48 KB	48 KB	48 KB	112 KB	64 KB	96 KB
Maximum amount of shared memory per thread block	16 KB	16 KB	48 KB	48 KB	48 KB	48 KB	48 KB	48 KB

Unified Memory

Data that is shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program. This adds a lot of complexity to CUDA programs.

Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

The CUDA runtime hides all the complexity, automatically migrating data to the place where it is accessed.

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory. Understandably so: the CUDA runtime never has as much information as the programmer does about where data is needed and when! CUDA programmers still have access to explicit device memory allocation and asynchronous memory copies to optimize data management and CPU-GPU concurrency.

Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

Asynchronous Concurrency Execution

Concurrent Data Access

Threads Synchronization

Threads can access each other's results through shared and global memory: they can work together. What if a thread reads a result before another thread writes it ? Threads need to synchronize.

But threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function. __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed.

The __syncthreads() command is a block level synchronization barrier. That means is safe to be used when all threads in a block reach the barrier.

Atomic Functions

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory. For example, atomicAdd() reads a word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads: no other thread can access this address until the operation is complete.

Atomic functions can only be used in device functions.
Atomic functions operating on mapped page-locked memory (Mapped Memory) are not atomic from the point of view of the host or other devices.

Arithmetic Functions	Bitwise Functions
atomicAdd atomicSub atomicExch atomicMin atomicMax atomicInc atomicDec atomicCAS	atomicAnd atomicOr atomicXor

Concurrent Execution between Host and Device

In order to facilitate concurrent execution between host and device, some function calls are asynchronous:

Kernel launches
Memory copies between two addresses to the same device memory
Memory copies from host to device of a memory block of 64 KB or less
Memory copies performed by functions that are suffixed with Async
Memory set function calls

Control is returned to the host thread before the device has completed the requested task.

Concurrent Data Transfers

Devices with Compute Capability >= 2.0 can perform a copy from page-locked host memory to device memory concurrently with a copy from device memory to page-locked host memory.

Applications may query this capability by checking the asyncEngineCount device property, which is equal to 2 for devices that support it.

Concurrent Kernel Execution

Devices with compute capability >= 2.0 and can execute multiple kernels concurrently. Applications may query this capability by checking the concurrentKernels device property, which is equal to 1 for devices that support it.

The maximum number of kernel launches that a device can execute concurrently is:

| Compute Capability | Maximum number of concurrent kernels per GPU | |---|---|---| | 1.x | not supported | | 2.x | 16 | | 3.0, 3.2 | 16 | | 3.5, 3.7 | 32 | | 5.0, 5.2 | 32 |

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.

Overlap of Data Transfer and Kernel Execution

Some devices can perform copies between page-locked host memory and device memory concurrently with kernel execution.

Streams

Applications manage concurrency through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently.

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host <-> device memory copies.

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution, concurrent kernel execution, and/or concurrent data transfers.

Default Stream

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order in the default stream.

The relative priorities of streams can be specified at creation. At runtime, as blocks in low-priority schemes finish, waiting blocks in higher-priority streams are scheduled in their place.

Explicit Synchronization

There are various ways to explicitly synchronize streams with each other.

Synchonization functions	Description
`cudaDeviceSynchronize()`	Waits until all preceding commands in ALL streams of ALL host threads have completed.
`cudaStreamSynchronize()`	Takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.
`cudaStreamWaitEvent()`	Takes a stream and an event as parameters and makes all the commands added to the given stream after the call to `cudaStreamWaitEvent()` delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call to `cudaStreamWaitEvent()` wait on the event.
`cudaStreamQuery()`	Provides applications with a way to know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing.

Implicit Synchronization

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:

a page-locked host memory allocation,
a device memory allocation,
a device memory set,
a memory copy between two addresses to the same device memory,
any CUDA command to the NULL stream,
a switch between the L1/shared memory configurations ( Compute Capability 2.x and 3.x).

For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:

Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream.

Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:

All independent operations should be issued before dependent operations,
Synchronization of any kind should be delayed as long as possible.

Callbacks

The runtime provides a way to insert a callback at any point into a stream via cudaStreamAddCallback(). A callback is a function that is executed on the host once all commands issued to the stream before the callback have completed. Callbacks in stream 0 are executed once all preceding tasks and commands issued in all streams before the callback have completed.

The commands that are issued in a stream (or all commands issued to any stream if the callback is issued to stream 0) after a callback do not start executing before the callback has completed.

A callback must not make CUDA API calls (directly or indirectly), as it might end up waiting on itself if it makes such a call leading to a deadlock.

Events

The runtime also provides a way to closely monitor the device's progress, as well as perform accurate timing, by letting the application asynchronously record events at any point in the program and query when these events are completed. An event has completed when all tasks - or optionally, all commands in a given stream - preceding the event have completed. Events in stream zero are completed after all preceding tasks and commands in all streams are completed.

Dynamic Parallelism

Dynamic Parallelism enables a CUDA kernel to create and synchronize new nested work, using the CUDA runtime API to launch other kernels, optionally synchronize on kernel completion, perform device memory management, and create and use streams and events, all without CPU involvement.

Note: Need Compute Capability >= 3.5

Dynamic Parallelism dynamically spawns new threads by adapting to the data without going back to the CPU, greatly simplifying GPU programming and accelerating algorithms.

Hyper-Q

Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times. This feature increases the total number of “connections” between the host and GPU by allowing 32 simultaneous, hardware-managed connections, compared to the single connection available with GPUs without Hyper-Q. Hyper-Q is a flexible solution that allows connections for both CUDA streams and Message Passing Interface (MPI) processes, or even threads from within a process. Existing applications that were previously limited by false dependencies can see a dramatic performance increase without changing any existing code.

Note: Need Compute Capability >= 3.5

Multi-Device System

A host system can have multiple GPU Devices.

Stream and Event Behavior

A kernel launch will fail if it is issued to a stream that is not associated to the current device.
A memory copy will succeed even if it is issued to a stream that is not associated to the current device.
Each device has its own default stream, so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.

GPU Direct

Accelerated communication with network and storage devices
Network and GPU device drivers can share “pinned” (page-locked) buffers, eliminating the need to make a redundant copy in CUDA host memory.

Peer-to-Peer Transfers between GPUs
Use high-speed DMA transfers to copy data between the memories of two GPUs on the same system/PCIe bus.

Peer-to-Peer memory access
Optimize communication between GPUs using NUMA-style access to memory on other GPUs from within CUDA kernels.

RDMA

Eliminate CPU bandwidth and latency bottlenecks using remote direct memory access (RDMA) transfers between GPUs and other PCIe devices, resulting in significantly improved MPISendRecv efficiency between GPUs and other nodes)

GPUDirect for Video
Optimized pipeline for frame-based devices such as frame grabbers, video switchers, HD-SDI capture, and CameraLink devices.

GPUDirect version 1 supported accelerated communication with network and storage devices. It was supported by InfiniBand solutions available from Mellanox and others.
GPUDirect version 2 added support for peer-to-peer communication between GPUs on the same shared memory server.
GPU Direct RDMA enables RDMA transfers across an Infiniband network between GPUs in different cluster nodes, bypassing CPU host memory altogether.

Using GPUDirect, 3rd party network adapters, solid-state drives (SSDs) and other devices can directly read and write CUDA host and device memory. GPUDirect eliminates unnecessary system memory copies, dramatically lowers CPU overhead, and reduces latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA products.

ANNEXES

CUDA Libraries

cuBLAS (Basic Linear Algebra Subroutines)

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library.

Complete support for all 152 standard BLAS routines
Single, double, complex, and double complex data types
Support for CUDA streams
Fortran bindings
Support for multiple GPUs and concurrent kernels
Batched GEMM API
Device API that can be called from CUDA kernels
Batched LU factorization API
Batched matrix inverse API
New implementation of TRSV (Triangular solve), up to 7x faster than previous implementation.

To learn more about cuBLAS: https://developer.nvidia.com/cuBLAS

cuSPARSE (Sparse Matrix)

The NVIDIA CUDA Sparse Matrix library (cuSPARSE) provides a collection of basic linear algebra subroutines used for sparse matrices that delivers up to 8x faster performance than the latest MKL. The latest release includes a sparse triangular solver.

Supports dense, COO, CSR, CSC, ELL/HYB and Blocked CSR sparse matrix formats Level 1 routines for sparse vector x dense vector operations Level 2 routines for sparse matrix x dense vector operations Level 3 routines for sparse matrix x multiple dense vectors (tall matrix) Routines for sparse matrix by sparse matrix addition and multiplication Conversion routines that allow conversion between different matrix formats Sparse Triangular Solve Tri-diagonal solver Incomplete factorization preconditioners ilu0 and ic0

To learn more about cuSPARSE: https://developer.nvidia.com/cuSPARSE

cuSOLVER (Solver)

The NVIDIA cuSOLVER library provides a collection of dense and sparse direct solvers which deliver significant acceleration for Computer Vision, CFD, Computational Chemistry, and Linear Optimization applications.

cusolverDN: Key LAPACK dense solvers 3-6x faster than MKL.
- Dense Cholesky, LU, SVD, QR
- Applications include: optimization, Computer Vision, CFD
cusolverSP
- Sparse direct solvers & Eigensolvers
- Applications include: Newton's method, Chemical Kinetics
cusolverRF
- Sparse refactorization solver
- Applications include: Chemistry, ODEs, Circuit simulation

To learn more about cuSOLVER: https://developer.nvidia.com/cuSOLVER

cuFFT (Fast Fourier Transformation)

The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets. It is one of the most important and widely used numerical algorithms in computational physics and general signal processing.

To learn more about cuFFT: https://developer.nvidia.com/cuFFT

cuRAND (Random Number)

The cuRAND library provides facilities that focus on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers. A pseudorandom sequence of numbers satisfies most of the statistical properties of a truly random sequence but is generated by a deterministic algorithm. A quasirandom sequence of n -dimensional points is generated by a deterministic algorithm designed to fill an n-dimensional space evenly.

Random numbers can be generated on the device or on the host CPU.

To learn more about cuRAND: https://developer.nvidia.com/cuRAND

cuDNN (Deep Neural Network)

cuDNN is a GPU-accelerated library of primitives for deep neural networks. It provides highly tuned implementations of routines arising frequently in DNN applications:

Convolution forward and backward, including cross-correlation
Pooling forward and backward
Softmax forward and backward
Neuron activations forward and backward:
- Rectified linear (ReLU)
- Sigmoid
- Hyperbolic tangent (TANH)
Tensor transformation functions

To learn more about cuDNN: https://developer.nvidia.com/cuDNN

NPP (NVIDIA Performance Primitive)

NVIDIA Performance Primitive (NPP) is a library of functions for performing CUDA accelerated processing. The initial set of functionality in the library focuses on imaging and video processing and is widely applicable for developers in these areas. NPP will evolve over time to encompass more of the compute heavy tasks in a variety of problem domains.

Eliminates unnecessary copying of data to/from CPU memory
- Process data that is already in GPU memory
- Leave results in GPU memory so they are ready for subsequent processing
Data Exchange and Initialization
- Set, Convert, Copy, CopyConstBorder, Transpose, SwapChannels
Arithmetic and Logical Operations
- Add, Sub, Mul, Div, AbsDiff, Threshold, Compare
Color Conversion
- RGBToYCbCr, YcbCrToRGB, YCbCrToYCbCr, ColorTwist, LUT_Linear
Filter Functions
- FilterBox, Filter, FilterRow, FilterColumn, FilterMax, FilterMin, Dilate, Erode, SumWindowColumn, SumWindowRow
JPEG
- DCTQuantInv, DCTQuantFwd, QuantizationTableJPEG
Geometry Transforms
- Mirror, WarpAffine, WarpAffineBack, WarpAffineQuad, WarpPerspective, WarpPerspectiveBack , WarpPerspectiveQuad, Resize
Statistics Functions
- Mean_StdDev, NormDiff, Sum, MinMax, HistogramEven, RectStdDev

To learn more about NPP: https://developer.nvidia.com/NPP

Atomic Functions

FUNCTION	DESCRIPTION

Arithmetic
atomicAdd	`int atomicAdd(int* address, int val)` `int atomicAdd(int* address, int val)` `unsigned int atomicAdd(unsigned int* address, unsigned int val)` `unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val)` `float atomicAdd(float* address, float val)` Reads the 32-bit or 64-bit word old located at the address `address` in global or shared memory, computes `old + val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`. The floating-point version of atomicAdd() is only supported by devices of compute capability 2.x and higher.
atomicSub	`int atomicSub(int* address, int val)` `unsigned int atomicSub(unsigned int* address, unsigned int val)` Reads the 32-bit word `old` located at the address `address` in global or shared memory, computes `old - val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`.
atomicExch	`int atomicExch(int* address, int val)` `unsigned int atomicExch(unsigned int* address, unsigned int val)` `unsigned long long int atomicExch(unsigned long long int* address, unsigned long long int val)` `float atomicExch(float* address, float val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory and stores `val` back to memory at the same address. These two operations are performed in one atomic transaction. The function returns `old`.
atomicMin	`int atomicMin(int* address, int val)` `unsigned int atomicMin(unsigned int* address, unsigned int val)` `unsigned long long int atomicMin(unsigned long long int* address, unsigned long long int val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory, computes the minimum of `old` and `val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`. The 64-bit version of atomicMin() is only supported by devices of compute capability 3.5 and higher.
atomicMax	`int atomicMax(int* address, int val)` `unsigned int atomicMax(unsigned int* address, unsigned int val)` `unsigned long long int atomicMax(unsigned long long int* address, unsigned long long int val)` Reads the 32-bit or 64-bit word old located at the address `address` in global or shared memory, computes the maximum of `old` and `val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`. The 64-bit version of atomicMax() is only supported by devices of compute capability 3.5 and higher.
atomicInc	`unsigned int atomicInc(unsigned int* address, unsigned int val)` Reads the 32-bit word old located at the address `address` in global or shared memory, computes `(old >= val) ? 0 : (old+1)`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`.
atomicDec	`unsigned int atomicDec(unsigned int* address, unsigned int val)` Reads the 32-bit word `old` located at the address `address` in global or shared memory, computes `((old == 0)
atomicCAS	`int atomicCAS(int* address, int compare, int val)` `unsigned int atomicCAS(unsigned int* address, unsigned int compare, unsigned int val)` `unsigned long long int atomicCAS(unsigned long long int* address, unsigned long long int compare, unsigned long long int val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory, computes `old == compare ? val : old` , and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old` (Compare And Swap).

Bitwise
atomicAnd	`int atomicAnd(int* address, int val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory, computes `old & val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`.
atomicOr	`int atomicOr(int* address, int val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory, computes `old
atomicXor	`int atomicXor(int* address, int val)` Reads the 32-bit or 64-bit word `old` located at the address `address` in global or shared memory, computes `old ^ val`, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns `old`.

saadmahboob / cuda-1 Goto Github PK

cuda-1's Introduction

CUDA Programming

Table of contents

Annexes

Introduction

Terminology

CUDA Software Architecture

Versioning and Compatibility

Thread Hierarchy

Kernel

Thread

Warp

Block

Grid

Memory

Host Memory

Device Memory

Unified Memory

Asynchronous Concurrency Execution

Concurrent Data Access

Threads Synchronization

Atomic Functions

Concurrent Execution between Host and Device

Concurrent Data Transfers

Concurrent Kernel Execution

Overlap of Data Transfer and Kernel Execution

Streams

Default Stream

Explicit Synchronization

Implicit Synchronization

Callbacks

Events

Dynamic Parallelism

Hyper-Q

Multi-Device System

Stream and Event Behavior

GPU Direct

ANNEXES

CUDA Libraries

cuBLAS (Basic Linear Algebra Subroutines)

cuSPARSE (Sparse Matrix)

cuSOLVER (Solver)

cuFFT (Fast Fourier Transformation)

cuRAND (Random Number)

cuDNN (Deep Neural Network)

NPP (NVIDIA Performance Primitive)

Atomic Functions

cuda-1's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org