Coder Social home page Coder Social logo

nvidia / cuda-samples Goto Github PK

View Code? Open in Web Editor NEW
5.4K 113.0 1.6K 132.96 MB

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

License: Other

C 73.70% C++ 26.14% Makefile 0.14% Python 0.02%
cuda cuda-kernels cuda-opengl cuda-driver-api

cuda-samples's Introduction

CUDA Samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit. This version supports CUDA Toolkit 12.4.

Release Notes

This section describes the release notes for the CUDA Samples on GitHub only.

CUDA 12.4

  • Hopper Confidential Computing Modes do not support Video samples, nor do they support host-pinned memory due to the restrictions created by CPU IOMMUs. The following Samples are affected:
    • convolutionTexture
    • cudaNvSci
    • dct8x8
    • lineOfSight
    • simpleCubemapTexture
    • simpleIPC
    • simpleLayeredTexture
    • simplePitchLinearTexture
    • simpleStream
    • simpleTexture
    • simpleTextureDrv
    • watershedSegmentationNPP

Getting Started

Prerequisites

Download and install the CUDA Toolkit 12.4 for your corresponding platform. For system requirements and installation instructions of cuda toolkit, please refer to the Linux Installation Guide, and the Windows Installation Guide.

Getting the CUDA Samples

Using git clone the repository of CUDA Samples using the command below.

git clone https://github.com/NVIDIA/cuda-samples.git

Without using git the easiest way to use these samples is to download the zip file containing the current version by clicking the "Download ZIP" button on the repo page. You can then unzip the entire archive and use the samples.

Building CUDA Samples

Windows

The Windows samples are built using the Visual Studio IDE. Solution files (.sln) are provided for each supported version of Visual Studio, using the format:

*_vs<version>.sln - for Visual Studio <version>

Complete samples solution files exist at parent directory of the repo:

Each individual sample has its own set of solution files at: <CUDA_SAMPLES_REPO>\Samples\<sample_dir>\

To build/examine all the samples at once, the complete solution files should be used. To build/examine a single sample, the individual sample solution files should be used.

Linux

The Linux samples are built using makefiles. To use the makefiles, change the current directory to the sample directory you wish to build, and run make:

$ cd <sample_dir>
$ make

The samples makefiles can take advantage of certain options:

  • TARGET_ARCH= - cross-compile targeting a specific architecture. Allowed architectures are x86_64, ppc64le, armv7l, aarch64. By default, TARGET_ARCH is set to HOST_ARCH. On a x86_64 machine, not setting TARGET_ARCH is the equivalent of setting TARGET_ARCH=x86_64.
    $ make TARGET_ARCH=x86_64
    $ make TARGET_ARCH=ppc64le
    $ make TARGET_ARCH=armv7l
    $ make TARGET_ARCH=aarch64
    See here for more details on cross platform compilation of cuda samples.

  • dbg=1 - build with debug symbols

    $ make dbg=1
    
  • SMS="A B ..." - override the SM architectures for which the sample will be built, where "A B ..." is a space-delimited list of SM architectures. For example, to generate SASS for SM 50 and SM 60, use SMS="50 60".

    $ make SMS="50 60"
    
  • HOST_COMPILER=<host_compiler> - override the default g++ host compiler. See the Linux Installation Guide for a list of supported host compilers.

    $ make HOST_COMPILER=g++
    

Samples list

Basic CUDA samples for beginners that illustrate key concepts with using CUDA and CUDA runtime APIs.

Utility samples that demonstrate how to query device capabilities and measure GPU/CPU bandwidth.

Samples that demonstrate CUDA related concepts and common problem solving techniques.

Samples that demonstrate CUDA Features (Cooperative Groups, CUDA Dynamic Parallelism, CUDA Graphs etc).

Samples that demonstrate how to use CUDA platform libraries (NPP, NVJPEG, NVGRAPH cuBLAS, cuFFT, cuSPARSE, cuSOLVER and cuRAND).

Samples that are specific to domain (Graphics, Finance, Image Processing).

Samples that demonstrate performance optimization.

Samples that demonstrate the use of libNVVVM and NVVM IR.

Dependencies

Some CUDA Samples rely on third-party applications and/or libraries, or features provided by the CUDA Toolkit and Driver, to either build or execute. These dependencies are listed below.

If a sample has a third-party dependency that is available on the system, but is not installed, the sample will waive itself at build time.

Each sample's dependencies are listed in its README's Dependencies section.

Third-Party Dependencies

These third-party dependencies are required by some CUDA samples. If available, these dependencies are either installed on your system automatically, or are installable via your system's package manager (Linux) or a third-party website.

FreeImage

FreeImage is an open source imaging library. FreeImage can usually be installed on Linux using your distribution's package manager system. FreeImage can also be downloaded from the FreeImage website.

To set up FreeImage on a Windows system, extract the FreeImage DLL distribution into the folder ../../../Common/FreeImage/Dist/x64 such that it contains the .h and .lib files. Copy the .dll file to root level bin/win64/Debug and bin/win64/Release folder.

Message Passing Interface

MPI (Message Passing Interface) is an API for communicating data between distributed processes. A MPI compiler can be installed using your Linux distribution's package manager system. It is also available on some online resources, such as Open MPI. On Windows, to build and run MPI-CUDA applications one can install MS-MPI SDK.

Only 64-Bit

Some samples can only be run on a 64-bit operating system.

DirectX

DirectX is a collection of APIs designed to allow development of multimedia applications on Microsoft platforms. For Microsoft platforms, NVIDIA's CUDA Driver supports DirectX. Several CUDA Samples for Windows demonstrates CUDA-DirectX Interoperability, for building such samples one needs to install Microsoft Visual Studio 2012 or higher which provides Microsoft Windows SDK for Windows 8.

DirectX12

DirectX 12 is a collection of advanced low-level programming APIs which can reduce driver overhead, designed to allow development of multimedia applications on Microsoft platforms starting with Windows 10 OS onwards. For Microsoft platforms, NVIDIA's CUDA Driver supports DirectX. Few CUDA Samples for Windows demonstrates CUDA-DirectX12 Interoperability, for building such samples one needs to install Windows 10 SDK or higher, with VS 2015 or VS 2017.

OpenGL

OpenGL is a graphics library used for 2D and 3D rendering. On systems which support OpenGL, NVIDIA's OpenGL implementation is provided with the CUDA Driver.

OpenGL ES

OpenGL ES is an embedded systems graphics library used for 2D and 3D rendering. On systems which support OpenGL ES, NVIDIA's OpenGL ES implementation is provided with the CUDA Driver.

Vulkan

Vulkan is a low-overhead, cross-platform 3D graphics and compute API. Vulkan targets high-performance realtime 3D graphics applications such as video games and interactive media across all platforms. On systems which support Vulkan, NVIDIA's Vulkan implementation is provided with the CUDA Driver. For building and running Vulkan applications one needs to install the Vulkan SDK.

OpenMP

OpenMP is an API for multiprocessing programming. OpenMP can be installed using your Linux distribution's package manager system. It usually comes preinstalled with GCC. It can also be found at the OpenMP website.

Screen

Screen is a windowing system found on the QNX operating system. Screen is usually found as part of the root filesystem.

X11

X11 is a windowing system commonly found on *-nix style operating systems. X11 can be installed using your Linux distribution's package manager, and comes preinstalled on Mac OS X systems.

EGL

EGL is an interface between Khronos rendering APIs (such as OpenGL, OpenGL ES or OpenVG) and the underlying native platform windowing system.

EGLOutput

EGLOutput is a set of EGL extensions which allow EGL to render directly to the display.

EGLSync

EGLSync is a set of EGL extensions which provides sync objects that are synchronization primitive, representing events whose completion can be tested or waited upon.

NVSCI

NvSci is a set of communication interface libraries out of which CUDA interops with NvSciBuf and NvSciSync. NvSciBuf allows applications to allocate and exchange buffers in memory. NvSciSync allows applications to manage synchronization objects which coordinate when sequences of operations begin and end.

NvMedia

NvMedia provides powerful processing of multimedia data for true hardware acceleration across NVIDIA Tegra devices. Applications leverage the NvMedia Application Programming Interface (API) to process the image and video data.

CUDA Features

These CUDA features are needed by some CUDA samples. They are provided by either the CUDA Toolkit or CUDA Driver. Some features may not be available on your system.

CUFFT Callback Routines

CUFFT Callback Routines are user-supplied kernel routines that CUFFT will call when loading or storing data. These callback routines are only available on Linux x86_64 and ppc64le systems.

CUDA Dynamic Parallellism

CDP (CUDA Dynamic Parallellism) allows kernels to be launched from threads running on the GPU. CDP is only available on GPUs with SM architecture of 3.5 or above.

Multi-block Cooperative Groups

Multi Block Cooperative Groups(MBCG) extends Cooperative Groups and the CUDA programming model to express inter-thread-block synchronization. MBCG is available on GPUs with Pascal and higher architecture.

Multi-Device Cooperative Groups

Multi Device Cooperative Groups extends Cooperative Groups and the CUDA programming model enabling thread blocks executing on multiple GPUs to cooperate and synchronize as they execute. This feature is available on GPUs with Pascal and higher architecture.

CUBLAS

CUBLAS (CUDA Basic Linear Algebra Subroutines) is a GPU-accelerated version of the BLAS library.

CUDA Interprocess Communication

IPC (Interprocess Communication) allows processes to share device pointers.

CUFFT

CUFFT (CUDA Fast Fourier Transform) is a GPU-accelerated FFT library.

CURAND

CURAND (CUDA Random Number Generation) is a GPU-accelerated RNG library.

CUSPARSE

CUSPARSE (CUDA Sparse Matrix) provides linear algebra subroutines used for sparse matrix calculations.

CUSOLVER

CUSOLVER library is a high-level package based on the CUBLAS and CUSPARSE libraries. It combines three separate libraries under a single umbrella, each of which can be used independently or in concert with other toolkit libraries. The intent ofCUSOLVER is to provide useful LAPACK-like features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cuSolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.

NPP

NPP (NVIDIA Performance Primitives) provides GPU-accelerated image, video, and signal processing functions.

NVGRAPH

NVGRAPH is a GPU-accelerated graph analytics library.

NVJPEG

NVJPEG library provides high-performance, GPU accelerated JPEG decoding functionality for image formats commonly used in deep learning and hyperscale multimedia applications.

NVRTC

NVRTC (CUDA RunTime Compilation) is a runtime compilation library for CUDA C++.

Stream Priorities

Stream Priorities allows the creation of streams with specified priorities. Stream Priorities is only available on GPUs with SM architecture of 3.5 or above.

Unified Virtual Memory

UVM (Unified Virtual Memory) enables memory that can be accessed by both the CPU and GPU without explicit copying between the two. UVM is only available on Linux and Windows systems.

16-bit Floating Point

FP16 is a 16-bit floating-point format. One bit is used for the sign, five bits for the exponent, and ten bits for the mantissa.

C++11 CUDA

NVCC support of C++11 features.

CMake

The libNVVM samples are built using CMake 3.10 or later.

Contributors Guide

We welcome your input on issues and suggestions for samples. At this time we are not accepting contributions from the public, check back here as we evolve our contribution model.

We use Google C++ Style Guide for all the sources https://google.github.io/styleguide/cppguide.html

Frequently Asked Questions

Answers to frequently asked questions about CUDA can be found at http://developer.nvidia.com/cuda-faq and in the CUDA Toolkit Release Notes.

References

Attributions

cuda-samples's People

Contributors

andydick avatar mdoijade avatar rnertney avatar ru7w1k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuda-samples's Issues

p2p samples crash GPUs

Using different configurations of 3 1080TIs, with and without SLI, I'm able to reproduce a crash with the p2p samples.

Tested:

./0_Simple/simpleP2P/simpleP2P
./1_Utilities/p2pBandwidthLatencyTest

GPUS:

  • EVGA 1080TI FTW3

NVIDIA Driver(s):

  • several - kernel confirmed upto 455.23.04
  • cuda 10 and 11

Motherboard(s):

  • Gigabyte X399 AORUS Gaming 7
  • ASUS X399 ROG Zenith Extreme

Though X becomes hosed, no errors are logged in syslog and the host remains responsive. A running mpv process for example will continue outputting audio.

Attempts to reset the GPUs and/or unload the driver fail as does killing X (display goes black).

No issues with individual hardware though the driver and/or userland utilties like nvidia-settings and nvidia-smi really do not play well with multi GPUs in Linux.

I'm less concerned with the functionality and more that regular users are able to trigger it.

make file error

when i sudo make i get this error:
nvcc fatal : Unsupported gpu architecture 'compute_80'
Makefile:311: recipe for target 'bandwidthTest.o' failed
make[1]: *** [bandwidthTest.o] Error 1
make[1]: Leaving directory '/home/parth/cuda-samples/Samples/bandwidthTest'
Makefile:45: recipe for target 'Samples/bandwidthTest/Makefile.ph_build' failed
make: *** [Samples/bandwidthTest/Makefile.ph_build] Error 2

please help

vulkan samples not building

What

Vulkan samples do not build

Where

Debian buster (10.5) X86_64

Why

findvulkan.mk has hardcoded paths for each distro. Unfortunately, debian is not one of the included distro and since the makefile fails to match a case, it declares libX11.so as missing.

Solution

/usr/sbin/ldconfig -p | grep -i libx11 is a more reliable way to search for libraries on any modern linux system that cuda supports.

NVJepg not support for 4 channels jpeg compress

I need to compress a image with RGBA four channels data, and send the compressed data to another pc from network. libjpeg-turbo supports CMYK format to compress four channels data, but I have no idea how to use nvjpeg to compress a image with RGBA four channels data. Can someone help me?

calculation time and method used in npp box filter

hi,
i am start using cuda to do some image processing task. I am interested in the method implemented in npp box filter. Because i found it calculate too long (3ms, 1024*1024,radius 6) , but when i used seperate filter method provided in _Imaging/boxFilter, which only cost averagely 0.7ms to do the same thing,
Is npp boxfilter not a seperate filter method? Or the texture use lead to the improvement?
looking forward to answer! thanks

nvJPEG_encode failed with image on size 8500*5667 on cuda10.2

I build cuda samples install by cuda installer,then I build nvJPEG_encode success on my platform:
Driver 440.82 CUDA Version:10.2,I run success by call:
./nvJPEG_encoder -i src/ -o out/ -q 100 -s 420 -fmt rgbi
I had test several image success,then I test a image with size[8500*5667],call:
./nvJPEG_encoder -i src/ -o out/ -q 100 -s 420 -fmt rgbi
then abort, then I modify code print mem alloc/free msg:

`int dev_malloc(void **p, size_t s) {

int ret = (int)cudaMalloc(p, s);
printf("cudaMalloc %d ret=%d\n",s,ret);
return ret;
}
int dev_free(void *p) {
int ret = (int)cudaFree(p);
printf("cudaFree %p ret=%d\n",p,ret);
return ret;
}`

then I build program then run with this message:

`
[root@localhost nvJPEG_encoder]# ./nvJPEG_encoder -i scale/ -o out/ -q 100 -s 420 -fmt rgbi
GPU Device 0: "Pascal" with compute capability 6.1

Using GPU 0 (GeForce GTX 1080, 20 SMs, 2048 th/SM max, CC 6.1, ECC off)
cudaMalloc 16777216 ret=0
cudaMalloc 16384 ret=0
cudaMalloc 1024 ret=0
cudaMalloc 1024 ret=0
Processing file: scale/yl1-scale.jpg
Image is 3 channels.
Channel #0 size: 8501 x 5668
Channel #1 size: 8501 x 5668
Channel #2 size: 8501 x 5668
YUV 4:4:4 chroma subsampling
cudaMalloc 435338240 ret=0
cudaMalloc 145044480 ret=0
cudaMalloc 73431040 ret=0
cudaMalloc 885701632 ret=0
cudaFree 0x7ff55e800000 ret=0
cudaMalloc 24094720 ret=0
CUDA error at nvJPEG_encoder.cpp:258 code=6(NVJPEG_STATUS_EXECUTION_FAILED) "nvjpegEncodeRetrieveBitstream( nvjpeg_handle, encoder_state, obuffer.data(), &length, NULL)"
[root@localhost nvJPEG_encoder]# ./nvJPEG_encoder -i scale/ -o out/ -q 100 -s 420
GPU Device 0: "Pascal" with compute capability 6.1

Using GPU 0 (GeForce GTX 1080, 20 SMs, 2048 th/SM max, CC 6.1, ECC off)
cudaMalloc 16777216 ret=0
cudaMalloc 16384 ret=0
cudaMalloc 1024 ret=0
cudaMalloc 1024 ret=0
Processing file: scale/yl1-scale.jpg
Image is 3 channels.
Channel #0 size: 8501 x 5668
Channel #1 size: 8501 x 5668
Channel #2 size: 8501 x 5668
YUV 4:4:4 chroma subsampling
cudaMalloc 435338240 ret=0
cudaMalloc 145044480 ret=0
cudaMalloc 73431040 ret=0
cudaMalloc 885701632 ret=0
cudaFree 0x7fca96800000 ret=716
CUDA error at nvJPEG_encoder.cpp:230 code=5(NVJPEG_STATUS_ALLOCATOR_FAILURE) "nvjpegEncodeYUV(nvjpeg_handle, encoder_state, encode_params, &imgdesc, subsampling, widths[0], heights[0], NULL)"
[root@localhost nvJPEG_encoder]#
`

I can see that when encode yuv cudaFree return 716 which means cudaErrorMisalignedAddress,but when encode rgbi it abort at nvjpegEncodeRetrieveBitstream.
how can I resolve this probrem? please help me

This CUDA Sample cannot be built if the OpenMP compiler is not set up correctly.

I was trying build CUDA samples to test my CUDA installation on 2080Ti

I was on Ubuntu 19.10 and CUDA 10.2 is not supported with gcc & g++ 9 so I downgraded to 7 using below commands

    sudo apt-get install gcc-7 g++-7
    sudo update-alternatives --remove-all gcc
    sudo update-alternatives --remove-all g++
    sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 50
    sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-7 50
    sudo update-alternatives --config gcc
    sudo update-alternatives --config g++

After that I was able to install the CUDA toolkit. But when I do make in samples I get this error:

abhimanyuaryan@brainy:/usr/local/cuda-10.2/samples$ make
make[1]: Entering directory '/usr/local/cuda-10.2/samples/0_Simple/UnifiedMemoryStreams'
/bin/sh: 1: cannot create test.c: Permission denied
/bin/sh: 1: cannot create test.c: Permission denied
g++: error: test.c: No such file or directory
g++: fatal error: no input files
compilation terminated.
-----------------------------------------------------------------------------------------------
WARNING - OpenMP is unable to compile
-----------------------------------------------------------------------------------------------
This CUDA Sample cannot be built if the OpenMP compiler is not set up correctly.
This will be a dry-run of the Makefile.
For more information on how to set up your environment to build and run this 
sample, please refer the CUDA Samples documentation and release notes
-----------------------------------------------------------------------------------------------
[@] /usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc -m64 -Xcompiler -fopenmp -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o UnifiedMemoryStreams.o -c UnifiedMemoryStreams.cu
[@] /usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o UnifiedMemoryStreams UnifiedMemoryStreams.o -lgomp -lcublas
[@] mkdir -p ../../bin/x86_64/linux/release
[@] cp UnifiedMemoryStreams ../../bin/x86_64/linux/release
make[1]: Leaving directory '/usr/local/cuda-10.2/samples/0_Simple/UnifiedMemoryStreams'
make[1]: Entering directory '/usr/local/cuda-10.2/samples/0_Simple/cdpSimpleQuicksort'
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -dc -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o cdpSimpleQuicksort.o -c cdpSimpleQuicksort.cu
Assembler messages:
Fatal error: can't create cdpSimpleQuicksort.o: Permission denied
make[1]: *** [Makefile:306: cdpSimpleQuicksort.o] Error 1
make[1]: Leaving directory '/usr/local/cuda-10.2/samples/0_Simple/cdpSimpleQuicksort'
make: *** [Makefile:51: 0_Simple/cdpSimpleQuicksort/Makefile.ph_build] Error 2

What's wrong? It something serious. Will it cause problem while I'll use tensorflow or Pytorch? If yes how to fix it?

Missing samples

Are there any plans to add the rest of the samples that come with CUDA? NPP-related samples have a few bugs I wanted to send PRs for.

cudamalloc error

Here is a problem confuse me a lot;
At first I use cudasetDevice set use device =1, then I don't know why, it change to gpu device=0 suddenly. Then i cudamalloc space to store gpu data, it has following error,cuda error code is 70, anyone has some idea ?
Also, if I set device =0, cudamalloc is normal, no error occur, but i really want to know why this happen!

========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000490 in void gemv2T_kernel_val<float, float, float, int=128, int=16, int=2, int=2, bool=0>(int, int, float, float const *, int, float const *, int, float, float*, int)
=========     by thread (95,0,0) in block (80,0,0)
=========     Address 0x7fe1d9f4283c is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x22b40d]```

Import External Memory fail

my os system is win10 x64
cuda version:v10.2.89
gpu is: GeForce GTX 1080

I want to copy a cef shared texture data and send it to another pc, just like remote desktop. First I must get the shared texture data, I use cudaImportExternalMemory function, the code below:

cudaError_t CudaD3D11Resource2BGR_A(void * shareHandle, int Size) {
if (!bInit) {
cudaSetDevice(0);
bInit = true;
}
unsigned char* BGRAbuffer = NULL;
cudaHostAlloc(&BGRAbuffer, Size, 0);

cudaExternalMemoryHandleDesc ExternalMemoryHandleDesc;
cudaExternalMemory_t ExternalMemory = NULL;
void *CudaDevVertptr = NULL;
cudaError_t Error = cudaSuccess;
memset(&ExternalMemoryHandleDesc, 0, sizeof(ExternalMemoryHandleDesc));

ExternalMemoryHandleDesc.type = cudaExternalMemoryHandleTypeD3D11ResourceKmt;
ExternalMemoryHandleDesc.handle.win32.handle = shareHandle;
ExternalMemoryHandleDesc.size = Size;
ExternalMemoryHandleDesc.flags = cudaExternalMemoryDedicated;
Error = cudaImportExternalMemory(&ExternalMemory, &ExternalMemoryHandleDesc);
if (Error) {
WCHAR out[500];
swprintf_s(out, 500, L"error:%hs\n", cudaGetErrorName(Error));
OutputDebugString(out);
return Error;
}
return cudaSuccess;
}

the CudaD3D11Resource2BGR_A will return error(code:999(cudaErrorUnknown)) when I invoke it like below:

ID3D11Texture2D* tex = nullptr;
auto hr = device_->OpenSharedResource(shareHandle, __uuidof(ID3D11Texture2D),
(void**)(&tex));
if (FAILED(hr)) {
return nullptr;
}

D3D11_TEXTURE2D_DESC td;
tex->GetDesc(&td);

CudaD3D11Resource2BGR_A(tex, td.Width * td.Height * 4);

the shareHandle is cef(Chromium Embedded Framework) provided in
OnAcceleratedPaint(
CefRefPtr browser,
CefRenderHandler::PaintElementType type,
const CefRenderHandler::RectList& dirtyRects,
void* share_handle)

I thinks the shareHandle is ok,and I can use it to render web data. I dont known what is wrong, can someone help me?

compilation error for target 'bodysystemcuda.o'

After I installed cuda-10.0 , I tried to test the example "nobody". However it showed the followings,

/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc -m64 -ftz=true -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o bodysystemcuda.o -c bodysystemcuda.cu
In file included from ../../common/inc/GL/freeglut.h:17:0,
from bodysystemcuda.cu:19:
../../common/inc/GL/freeglut_std.h:84:10: fatal error: GL/gl.h: No such file or directory
#include <GL/gl.h>
^~~~~~~~~
compilation terminated.
Makefile:304: recipe for target 'bodysystemcuda.o' failed
make: *** [bodysystemcuda.o] Error 1

I am sure that I have already installed freeglut and related mesa packages, and reinstalling them did not work. How to solve it?

Is there working cuda toolkit version for cuda-sample/Samples other than 10.0

I'm trying to execute cuda-samples using CUDA-9.0.
But it says samples' prerequisites are CUDA 10.0

Is it impossible to make it working it with CUDA-9.0?
The thing is... my work is about trying to run on gpgpu-sim which provides CUDA-9.0
However I get this error for every application in cuda-samples. (version libcudart.so.9.0 not found).
But it does exist at location where it says not found. Below shows existance at that location.
image

The below is results from running samples.
image
image

ANY kinds of help would be helpful.

Thank you
Sungwoo Ahn

Errors running make SMS="61" on Ubuntu 20.04

Errors I'm getting on Ubuntu 20.04, GTX 1080

make SMS="61"

make[1]: Entering directory '/home/nate/cuda-samples-11.1/Samples/cudaNvSci'

WARNING - libnvscibuf.so not found, please install libnvscibuf.so <<<
WARNING - libnvscisync.so not found, please install libnvscisync.so <<<
WARNING - nvscibuf.h not found, please install nvscibuf.h <<<
WARNING - nvscisync.h not found, please install nvscisync.h <<<

I installed everything, here's my output for nvidia-smi

Fri Dec 11 09:35:27 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 0% 49C P0 35W / 215W | 640MiB / 8116MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1380 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 3036 G /usr/lib/xorg/Xorg 309MiB |
| 0 N/A N/A 3163 G /usr/bin/gnome-shell 64MiB |
| 0 N/A N/A 3847 G ...AAAAAAAAA= --shared-files 16MiB |
| 0 N/A N/A 4960 G ...AAAAAAAA== --shared-files 82MiB |
| 0 N/A N/A 8021 G ...AAAAAAAAA= --shared-files 9MiB |
| 0 N/A N/A 8509 G ...AAAAAAAA== --shared-files 34MiB |
| 0 N/A N/A 11138 G gnome-control-center 1MiB |
| 0 N/A N/A 11459 G /opt/zoom/zoom 51MiB |
+-----------------------------------------------------------------------------+

CUDA_SEARCH_PATH=/usr/lib/x86_64-linux-gnu

Not finding a good answer on this anywhere for modern times in 2020.

unable to run cuda samples

install cuda

make[1]: Entering directory '/home/ahmed/cuda-samples/Samples/dmmaTensorCoreGemm'

GCC Version is greater or equal to 5.0.0 <<<
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common -m64 --std=c++11 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 -o dmmaTensorCoreGemm.o -c dmmaTensorCoreGemm.cu
make[1]: /usr/local/cuda/bin/nvcc: Command not found
make[1]: *** [Makefile:348: dmmaTensorCoreGemm.o] Error 127
make[1]: Leaving directory '/home/ahmed/cuda-samples/Samples/dmmaTensorCoreGemm'
make: *** [Makefile:45: Samples/dmmaTensorCoreGemm/Makefile.ph_build] Error 2

CUDA tested by nbody with errors

I've installed cuda 10.1 for GeForce RTX 2080. But when I finished installing, I try to use nbody to test. There is an error presented by cudaErrorOperatingSystem:

CUDA error at bodysystemcuda_impl.h:183 code=304(cudaErrorOperatingSystem) "cudaGraphicsGLRegisterBuffer(&m_pGRes[i], m_pbo[i], cudaGraphicsMapFlagsNone)"

Can anyone help me solve this problem? Thank you!

Writing R algorithm in CUDA kernel

I am working on a Data Science project which needs CUDA at the back-end. I want to know whether I can run an R script inside the CUDA kernel.

I have some restrictions of not changing the pre-defined algorithm written in R programming. Can I utilize the GPU without changing the R code(algorithm).

Here is my code that needs GPU acceleration using R:
`**#loading Necessary Library Packages
library(minet)
library(igraph)
library(d3r)
library(jsonlite)
library(stringr)

#operation on main data - Aracne
data = read.table(file = read_file_path, sep = '\t', header = TRUE)
col_name_ori = colnames(data[,-1])
edg_count=0
mim_main <- build.mim(data,estimator="spearman",disc ="none",nbins = sqrt(NROW(dataset)))
result_aracne_main <-aracne(mim_main,eps=0)

result_aracne_main <- result_aracne_main[,-1][-1,]
for(row in 1:nrow(result_aracne_main)){
for(col in 1:ncol(result_aracne_main)){
if(result_aracne_main[row,col]>0){
result_aracne_main[row,col] <- 1
edg_count <- edg_count+1
}
else{
result_aracne_main[row,col]<-0
}
}
}**`

make error: nvcc fatal : Unknown option '--threads'

I was following the instructions in this post: https://weblogs.asp.net/dixin/setup-and-use-cuda-and-tensorflow-in-windows-subsystem-for-linux-2 and got to the step of downloading/making this repository, I just used the master branch.
Error:

/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    --threads 0 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o concurrentKernels.o -c concurrentKernels.cu
nvcc fatal   : Unknown option '--threads'
make: *** [Makefile:323: concurrentKernels.o] Error 1

Checking out Tag v11.2 had the same result, checking out Tag v11.1 looks like it worked with some warnings:

/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o concurrentKernels.o -c concurrentKernels.cu
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o concurrentKernels concurrentKernels.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/x86_64/linux/release
cp concurrentKernels ../../bin/x86_64/linux/release

I just wanted to make people aware of it & how to get something to build in case they're also trying something similar. (And so that it can be fixed if it's a bug). Thanks!

[Samples/conjugateGradientCudaGraphs/Makefile.ph_build] Error 2

$ cuda-samples$ make TARGET_ARCH=x86_64

make[1]: Entering directory '/media/suryadi/DATA/learn/cuda-samples/Samples/bandwidthTest'
make[1]: Nothing to be done for 'all'.
make[1]: Leaving directory '/media/suryadi/DATA/learn/cuda-samples/Samples/bandwidthTest'
make[1]: Entering directory '/media/suryadi/DATA/learn/cuda-samples/Samples/conjugateGradientCudaGraphs'
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o conjugateGradientCudaGraphs.o -c conjugateGradientCudaGraphs.cu
conjugateGradientCudaGraphs.cu(326): error: identifier "cudaStreamCaptureModeGlobal" is undefined

conjugateGradientCudaGraphs.cu(326): error: too many arguments in function call

2 errors detected in the compilation of "/tmp/tmpxft_000032aa_00000000-14_conjugateGradientCudaGraphs.compute_75.cpp1.ii".
Makefile:292: recipe for target 'conjugateGradientCudaGraphs.o' failed
make[1]: *** [conjugateGradientCudaGraphs.o] Error 1
make[1]: Leaving directory '/media/suryadi/DATA/learn/cuda-samples/Samples/conjugateGradientCudaGraphs'
Makefile:45: recipe for target 'Samples/conjugateGradientCudaGraphs/Makefile.ph_build' failed
make: *** [Samples/conjugateGradientCudaGraphs/Makefile.ph_build] Error 2

My system:
Linux xtal 4.18.0-20-generic #21~18.04.1-Ubuntu SMP Wed May 8 08:43:37 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

How to compile demos correctly? Thank you very much in advance.

Warmest Regards,
Suryadi

nvjpegEncodeYUV NVJPEG_STATUS_INVALID_PARAMETER

Hi,
I use the sample named nvJPEG_encoder, and I use NVJPEG_CSS_GRAY as subsampling format to encode an image size(4x16), will get below error:
CUDA error at d:\cppworkspace\cuda-samples\samples\nvjpeg_encoder\nvjpeg_encoder.cpp:200 code=2(NVJPEG_STATUS_INVALID_PARAMETER) "nvjpegEncodeYUV(nvjpeg_handle, encoder_state, encode_params, &imgdesc, subsampling, widths[0], heights[0], NULL)"
I found if use the image width smaller than 8, use NVJPEG_CSS_GRAY encode an image will fail. Is there some limits when encode an image?

Support for tile_partition() of greater tile size?

I am only able to get tile_partitions of size 32 or smaller. This seems unecessarily limiting. Why aren't tiles of greater size supported? I don't understand how syncing across threads_groups of size smaller than 32 makes sense since a warp of 32 threads is executed in lockstep. Shouldn't thread_groups be in multiples of 32 then?

Is there a way of partitioning a grid of blocks into different groups?

Out-of-bounds read in reduction_kernel.cu variant reduce6 for array size 1

Hi,
your kernel reduce6 from the reduction example does an out-of-bounds read for trivial input size 1 here:

if (nIsPow2 || i + blockSize < n) mySum += g_idata[i + blockSize];

This is caused by determining that size 1 is also a power of 2 here:

extern "C" bool isPow2(unsigned int x) { return ((x & (x - 1)) == 0); }

but incorrectly assuming that we can increase by blockSize (==1).

Best,
Marcin

Helper_math.h header file missing from cuda-samples

Problem description

I am currently trying to make the (OpenKinect/libfreenect2)[https://github.com/OpenKinect/libfreenect2] inside my From: nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 singularity image. While building I however get the following error message.

/opt/libfreenect2/src/cuda_kde_depth_packet_processor.cu:39:25: fatal error: helper_math.h: No such file or directory
compilation terminated.
CMake Error at cuda_compile_generated_cuda_kde_depth_packet_processor.cu.o.cmake:207 (message):
  Error generating
  /opt/libfreenect2/build/CMakeFiles/cuda_compile.dir/src/./cuda_compile_generated_cuda_kde_depth_packet_processor.cu.o


CMakeFiles/freenect2.dir/build.make:80: recipe for target 'CMakeFiles/cuda_compile.dir/src/cuda_compile_generated_cuda_kde_depth_packet_processor.cu.o' failed
make[2]: *** [CMakeFiles/cuda_compile.dir/src/cuda_compile_generated_cuda_kde_depth_packet_processor.cu.o] Error 1
CMakeFiles/Makefile2:136: recipe for target 'CMakeFiles/freenect2.dir/all' failed
make[1]: *** [CMakeFiles/freenect2.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

I assume this is caused by the fact that the docker does not install the cuda-samples by default. I tried using this repository to solve the error but the helper_math.h header file appears not to be present in the repository. I therefore currently use the repository of zchee to achieve this. Is it maybe possible to add these header files to the repository so I can use the official repository instead?

Related topics

findvulkan.mk can't find vulkan in CentOS7

When building cuda-samples release 10.1.2 in CentOS7 x86-64, 'findvulkan.mk' doesn't automatically locate 'libvulkan.so' because it is looking in '${VULKAN_SDK_PATH}/lib' instead of '${VULKAN_SDK_PATH}/lib64' where the default repository's 'vulkan-devel' package deploys said library. This results in the vulkan related samples not building.

$ git log -1
commit c8483e0798cc90b49c2799ef6e8605ff0bfbfdf7
...
$ make
>>> WARNING - libvulkan.so not found, please install libvulkan.so <<<
...
$ locate libvulkan.so
/usr/lib64/libvulkan.so
...
$ yum provides /usr/lib64/libvulkan.so
...
vulkan-devel-1.1.97.0-1.el7.x86_64 : Vulkan development package
Repo        : base
Matched from:
Filename    : /usr/lib64/libvulkan.so
...
$ echo $VULKAN_SDK_PATH
/usr

The attached patch resolves the issue.
vulkan_el7.txt

10.1 releases tagged differently

Version 10.1 releases tagged without the 'v' in their name, but all other versions look like "v${CUDA_VERSION}". Is this happened on purpose? I use these samples to verify my automatic cuda installation and this exception of the tag naming made my tests fail. If you can, please fix the names or at least be consistent in the future.

Multiple two large numbers as strings and return result

//First attempt. Feel free to speed it up.

#include <stdio.h>
#define array_size 200

global
void collatz(char *mult1, char *mult2, char *cresult)
{
int sizea,sizeb,i,j ;
char a[array_size] ={0};
char b[array_size] ={0};
int result[array_size] ={0};

sizea= 0; while ( *(mult1+sizea) != '\0') {a[sizea] = *(mult1+sizea) ; sizea++ ;}
sizeb= 0; while ( *(mult2+sizeb) != '\0') {b[sizeb] = *(mult2+sizeb) ; sizeb++ ;} // i and j are now size of the string array

 for( i = sizea - 1; i >= 0; i-- )
 	{
 		for( j = sizeb - 1; j >= 0; j-- )
 		{
 			result[ i + j + 1 ] += (b[j]-'0')*(a[i]-'0'); // thanks: https://www.quora.com/How-do-I-multiply-two-large-numbers-stored-as-strings-in-the-C-C++-language
 		}

 	}
 	for(  i = sizea + sizeb; i >= 0; i-- ){
 		if( result[ i ] >= 10 ){
 			result[ i - 1 ] +=result[ i ] / 10;
 			result[ i ] %= 10;
 		}
 	}

 	for( i = 0; i < (sizea + sizeb); i++ ){
 		cresult[i] = result[i]+'0' ;
 	}

 	return ;

}

int main(void)
{
char x, y, z ;
cudaMallocManaged(&x, array_size
sizeof(char));
cudaMallocManaged(&y, array_size
sizeof(char));
cudaMallocManaged(&z, array_size
sizeof(char));

strcpy(x,"92937842347899823492793849823489");
strcpy(y,"3239478293345989359873457898489");
strcpy(z,"0");

collatz<<<1, 1>>>(x,y,z) ;

cudaDeviceSynchronize();

printf("%s times %s = %s",x,y,z);

// Free memory
cudaFree(x);
cudaFree(y);
cudaFree(z);

return 0;
}

ERROR: failed checking for nvcc.

System information (version)
  • OpenCV == 4.2.0.34
  • Operating System / Platform == ubuntu 18.04
  • Miniconda-python 3.7
  • Cuda 11.0
Detailed description

I follow this instruction to install ffmpeg. But it fail to check nvcc.

results wrong when executing cuda-samples/Samples/cudaTensorCoreGemm

When testing the code with different configuration: "M=32, N = 16, K=64, M_TILES = N_TILES = K_TILES=1", and turn on the CPU_DEBUG option, result of simple_wmma_gemm mismatched.

Line 460:
should be

if (aRow < m_ld && aCol < k_ld && bRow < n_ld && bCol < k_ld) 

line 464,
should be

wmma::load_matrix_sync(a_frag, a + aCol + aRow * k_ld, k_ld);

Compile error

desktop of dell with 1070Ti
input make in cuda-sample dir.
Error occurs as below:
make[1]: Entering directory '/home/mengzhibin/CLionProjects/cuda-samples/Samples/p2pBandwidthLatencyTest' /usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o p2pBandwidthLatencyTest.o -c p2pBandwidthLatencyTest.cu Segmentation fault (core dumped) Makefile:290: recipe for target 'p2pBandwidthLatencyTest.o' failed make[1]: *** [p2pBandwidthLatencyTest.o] Error 139 make[1]: Leaving directory '/home/mengzhibin/CLionProjects/cuda-samples/Samples/p2pBandwidthLatencyTest' Makefile:45: recipe for target 'Samples/p2pBandwidthLatencyTest/Makefile.ph_build' failed make: *** [Samples/p2pBandwidthLatencyTest/Makefile.ph_build] Error 2

Can't compile cuda samples

Hi there

I have troubles compiling the cuda samples in version 10.2 on a fresh ubuntu 18.04.
here's the issue:

In file included from cudaNvSci.cpp:12:0: cudaNvSci.h:14:10: fatal error: nvscibuf.h: No such file or directory #include <nvscibuf.h> ^~~~~~~~~~~~ compilation terminated. Makefile:394: recipe for target 'cudaNvSci.o' failed make[1]: *** [cudaNvSci.o] Error 1 make[1]: Leaving directory '/home/federico/NVIDIA_CUDA-10.2_Samples/0_Simple/cudaNvSci' Makefile:51: recipe for target '0_Simple/cudaNvSci/Makefile.ph_build' failed make: *** [0_Simple/cudaNvSci/Makefile.ph_build] Error 2

any idea?

NVIDIA CUDA 11.0 RC CUDNN 8 Docker Tensorflow

I'm not sure which forum is appropriate for this. I downloaded the latest docker image from NVIDIA NGC for NVIDIA/CUDA because it has the latest CUDA 11.0 with CUDNN 8.0. I also downloaded the latest tensorflow-20.03-tf3-.. from NGC.
My question is how do I configure this so that the tensorflow container looks for CUDA 11.0 and CUDNN 8.0 because right now it is looking for CUDA 10.02 and CUDNN 7. I thought maybe they created a compatible image because they updated it last week.

Thanks,

nvcc fatal : Unsupported gpu architecture 'compute_80'

Hi I tried to run the sample code here in ~/cuda-samples/Samples/p2pBandwidthLatencyTest on AWS g3 and p3 instance, with CUDA v10.0.130

When I try make it reports following errors:

ubuntu@ip-172-31-21-246:~/cuda-samples/Samples/p2pBandwidthLatencyTest$ make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_80,code=compute_80 -o p2pBandwidthLatencyTest.o -c p2pBandwidthLatencyTest.cu
nvcc fatal   : Unsupported gpu architecture 'compute_80'
Makefile:311: recipe for target 'p2pBandwidthLatencyTest.o' failed
make: *** [p2pBandwidthLatencyTest.o] Error 1

Cuda-8.0 sample aarch64 platform error

Hello, I am unable to make the samples folder for aarch64 platform using the command
make TARGET_ARCH=aarch64
Errors I encounter :-

aarch64-linux-gnu-g++ -o simpleAssert_nvrtc simpleAssert.o -lcuda -lnvrtc
/usr/lib/gcc-cross/aarch64-linux-gnu/5/../../../../aarch64-linux-gnu/bin/ld: cannot find -lcuda
/usr/lib/gcc-cross/aarch64-linux-gnu/5/../../../../aarch64-linux-gnu/bin/ld: cannot find -lnvrtc

an illegal memory access was encountered(CUDA error no.=700)

I am trying to convert rgb to gray-scale image with cuda gpu. The first step of converting rg2grayscale goes well. But when i trying to convert grayImage to smoothImage by applying filters, the programm compiles but give runtime error an illegal memory access was encountered(CUDA error no.=700) on line 160 CHECKCUDAERROR(cudaMemcpy(smoothImage, d3, size, cudaMemcpyDeviceToHost)).

#include <Timer.hpp>
#include <iostream>
#include <iomanip>
#include <iostream>
#include <iomanip>
#include <cstring>
#include "CImg.h"

#define CHECKCUDAERROR(err)     {if (cudaSuccess != err) {fprintf(stderr, "CUDA ERROR: %s(CUDA error no.=%d). Line no. %d in file %s\n", cudaGetErrorString(err), err,  __LINE__, __FILE__); exit(EXIT_FAILURE); }}

using LOFAR::NSTimer;
using std::cout;
using std::cerr;
using std::endl;
using std::fixed;
using std::setprecision;
using cimg_library::CImg;

// Constants
const bool displayImages = false;
const bool saveAllImages = false;
const unsigned int HISTOGRAM_SIZE = 256;
const unsigned int BAR_WIDTH = 4;
const unsigned int CONTRAST_THRESHOLD = 80;
const float filter[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 2.0f, 3.0f, 2.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};

unsigned char *d1 , *d2, *d3;

double realtime()
{
	struct timeval tp;
	struct timezone tzp;
	gettimeofday(&tp, &tzp);
	return tp.tv_sec + tp.tv_usec * 1e-6;
}
__global__ void rgb2gray(unsigned char *inputImage, unsigned char *grayImage, const int width, 
const int height) 
{

	int col = threadIdx.x + blockIdx.x * blockDim.x;
	int row = threadIdx.y + blockIdx.y * blockDim.y;

	if (col < width && row < height)
	{
	
			float grayPix = 0.0f;
			float r = static_cast< float >(inputImage[(row * width) + col]);
			float g = static_cast< float >(inputImage[(width * height) + (row * width) + col]);
			float b = static_cast< float >(inputImage[(2 * width * height) + (row * width) + col]);

			grayPix = (0.3f * r) + (0.59f * g) + (0.11f * b);

			grayImage[(row * width) + col] = static_cast< unsigned char >(grayPix);
		}
	}

__global__ void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, 
					  const int width, const int height, const float *filter) 
{

	int col = threadIdx.x + blockIdx.x * blockDim.x;
	int row = threadIdx.y + blockIdx.y * blockDim.y;

	if (col < width && row < height)
		{
			unsigned int filterItem = 0;
			float filterSum = 0.0f;
			float smoothPix = 0.0f;

			for ( int fy = row - 2; fy < row + 3; fy++ ) {
				for ( int fx = col - 2; fx < col + 3; fx++ ) {
					if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
						filterItem++;
						continue;
					}

					smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
					filterSum += filter[filterItem];
					filterItem++;
				}
			}

			smoothPix /= filterSum;
			smoothImage[(row * width) + col] = static_cast< unsigned char >(smoothPix);
		
		}
	
}
void rgb2gray_pl(unsigned char *inputImage, unsigned char *grayImage, const int width, const int height) {

    // Initialize device pointers.
    size_t size = width * height * sizeof(unsigned char);

    double cudamalloc_time = realtime();

    // Allocate device memory.
    CHECKCUDAERROR(cudaMalloc(&d1, 3*size));
	CHECKCUDAERROR(cudaMalloc(&d2, size));

	cout << fixed << setprecision(6);
	cout << "cudaMalloc: \t\t" << realtime() - cudamalloc_time << " seconds." << endl;

	double cudamemcpy_d1 = realtime();

    // Transfer from host to device.
    CHECKCUDAERROR(cudaMemcpy(d1, inputImage, 3*size, cudaMemcpyHostToDevice));

   	cout << fixed << setprecision(6);
	cout << "cudaMemcpy_host_to_device: \t\t" << realtime() - cudamemcpy_d1 << " seconds." << endl;

    double kernel_time = realtime();

	//define block and grid dimensions
	const dim3 dimGrid((int)ceil(((width +16) /16)), (int)ceil(((height + 16) /16)));
	const dim3 dimBlock(16, 16);
	
	//execute cuda kernel
	rgb2gray<<<dimGrid, dimBlock>>>(d1, d2, width, height);
	CHECKCUDAERROR(cudaPeekAtLastError());
    cout << fixed << setprecision(6);
	cout << "kernel: \t\t" << realtime() - kernel_time << " seconds." << endl;

	double cudamemcpy_d2 = realtime();

	//copy computed gray data array from device to host
	CHECKCUDAERROR(cudaMemcpy(grayImage, d2, size, cudaMemcpyDeviceToHost));

	cout << fixed << setprecision(6);
	cout << "cudaMemcpy_device_to_host: \t\t" << realtime() - cudamemcpy_d2 << " seconds." << endl;



}

void smooth_pl(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height) {

	size_t size = width * height * sizeof(unsigned char);

	double malloc_time = realtime();

	CHECKCUDAERROR(cudaMalloc(&d3, size));

	 cout << fixed << setprecision(6);
	cout << "triangular_smooth_malloc: \t\t" << realtime() - malloc_time << " seconds." << endl;

	double t_s_kernel_time = realtime();

	
	//execute cuda kernel
	const dim3 dimGrid((int)ceil(((width +16) /16)), (int)ceil(((height + 16) /16)));
	const dim3 dimBlock(16, 16);
	
	triangularSmooth<<<dimGrid, dimBlock>>>(d2, d3, width, height, filter);
	CHECKCUDAERROR(cudaPeekAtLastError());
    cout << fixed << setprecision(6);
	cout << "triangular_smooth_kernel: \t\t" << realtime() - t_s_kernel_time << " seconds." << endl;


	//copy computed smooth data array from device to host
	CHECKCUDAERROR(cudaMemcpy(smoothImage, d3, size, cudaMemcpyDeviceToHost));


	double cuda_free = realtime();

	CHECKCUDAERROR(cudaFree(d1));
	CHECKCUDAERROR(cudaFree(d2));
	CHECKCUDAERROR(cudaFree(d3));


	cout << fixed << setprecision(6);
	cout << "cudaFree: \t\t" << realtime() - cuda_free << " seconds." << endl;
}

int main(int argc, char *argv[]) 
{
	//NSTimer total = NSTimer("total", false, false);
        //double prev_time;

	if ( argc != 2 ) {
		cerr << "Usage: " << argv[0] << " <filename>" << endl;        
		cout << fixed << setprecision(6);
		return 1;
	}

	// Load the input image
	CImg< unsigned char > inputImage = CImg< unsigned char >(argv[1]);
	if ( displayImages ) {
		inputImage.display("Input Image");
	}
	if ( inputImage.spectrum() != 3 ) {
		//cerr << "The input must be a color image." << endl;
		//return 1;
	}
	double total_time = realtime();


	CImg<unsigned char> grayImage = CImg<unsigned char>(inputImage.width(), inputImage.height(), 1, 1);
	CImg< unsigned char > smoothImage = CImg< unsigned char >(grayImage.width(), grayImage.height(), 1, 1);

	rgb2gray_pl(inputImage.data(), grayImage.data(), inputImage.width(), inputImage.height());

    	cout << fixed << setprecision(6);
	cout << "Total: \t\t" << realtime() - total_time << " seconds." << endl;
	grayImage.save("./grayscale.bmp");

	smooth_pl(grayImage.data(),smoothImage.data(), grayImage.width(),  grayImage.height());

	smoothImage.save("./smooth.bmp");


	//allocate and initialize memory on device
	
	return 0;
}

NVJPEG_STATUS_EXECUTION_FAILED

Hi,
I use nvjpeg, found if the image size smaller than 2*256,like 1x1,the nvJPEG_encoder sample will recieve the following error:
CUDA error at d:\cppworkspace\cuda-samples\samples\nvjpeg_encoder\nvjpeg_encoder.cpp:200 code=6(NVJPEG_STATUS_EXECUTION_FAILED) "nvjpegEncodeYUV(nvjpeg_handle, encoder_state, encode_params, &imgdesc, subsampling, widths[0], heights[0], NULL)"

and the strange things is, the 1x1 image must be the first compressed,will trigger the bug,if the image is compressed after a big image like 480x640,the bug will dispear.
can someone help me?

cuda-samples/Samples/cudaTensorCoreGemm

Get the following error for cuda-samples/Samples/cudaTensorCoreGemm

Initializing...
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 68 Kb
Computing...
CUDA error at cudaTensorCoreGemm.cu:474 code=77(cudaErrorIllegalAddress) "cudaEventSynchronize(stop)"

The error goes away if I decrease M_TILES,N_TILES and K_TILES to 168(from 256).

Any ideas about this?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.