Hi, I'm getting the following error with Haskell cuda

It compiles fine but gives the same error: <div class="snippet-clipboard-content n

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Driver failure on AWS p3.2xlarge instance about cuda HOT 18 CLOSED

tmcdonell commented on July 22, 2024

Driver failure on AWS p3.2xlarge instance

from cuda.

Comments (18)

tmcdonell commented on July 22, 2024

I haven't tried running on aws, so I'm not sure what might be going on, and the error reported is not so informative ):

Just as a sanity check, can you try compiling and running the deviceQueryDrv program from the CUDA samples?

from cuda.

NickHu commented on July 22, 2024

It compiles fine but gives the same error:

root@7d05fbb8d8d5:/opt/accelerate-llvm# git clone https://github.com/tmcdonell/cuda.git
Cloning into 'cuda'...
remote: Counting objects: 4345, done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 4345 (delta 1), reused 6 (delta 1), pack-reused 4335
Receiving objects: 100% (4345/4345), 1.79 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (2384/2384), done.
Checking connectivity... done.
root@7d05fbb8d8d5:/opt/accelerate-llvm# stack ghc cuda/examples/src/deviceQueryDrv/DeviceQuery.hs
[1 of 1] Compiling Main             ( cuda/examples/src/deviceQueryDrv/DeviceQuery.hs, cuda/examples/src/deviceQueryDrv/DeviceQuery.o )
Linking cuda/examples/src/deviceQueryDrv/DeviceQuery ...
root@7d05fbb8d8d5:/opt/accelerate-llvm# cuda/examples/src/deviceQueryDrv/DeviceQuery
DeviceQuery: Status.toEnum: Cannot match -1
CallStack (from HasCallStack):
  error, called at src/Foreign/CUDA/Driver/Error.chs:372:22 in cuda-0.10.0.0-Lq313TS76CJ6ufZOzm0zPz:Foreign.CUDA.Driver.Error

from cuda.

tmcdonell commented on July 22, 2024

Oh, I meant, the one which ships from NVIDIA as part of the CUDA toolkit. It probably lives in /usr/local/cuda/samples/1_Utilities/deviceQueryDrv ?

from cuda.

NickHu commented on July 22, 2024

I can't find the sample you mentioned anywhere on the filesystem, or in https://github.com/NVIDIA/cuda-samples.git,
but it does have deviceQuery, but it runs with some sort of failure

root@7d05fbb8d8d5:/tmp/cuda-samples# bin/x86_64/linux/release/deviceQuery
bin/x86_64/linux/release/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

from cuda.

NickHu commented on July 22, 2024

I found a version in https://github.com/zchee/cuda-sample.git, and it compiles with some minor tweaking (commenting out line 202: GENCODE_FLAGS += -gencode arch=compute_20,code=compute_20), but its output is also a failure

root@7d05fbb8d8d5:/tmp/cuda-sample/1_Utilities/deviceQueryDrv# ./deviceQueryDrv
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
cuInit(0) returned -1
-> (null)
Result = FAIL

from cuda.

tmcdonell commented on July 22, 2024

Okay, I think that CUDA is not installed correctly on this system. It looks like that somebody installed a new version of the CUDA toolkit but did not update the device driver at the same time to match. Try reinstalling / updating the driver?

from cuda.

NickHu commented on July 22, 2024

@tmcdonell I can confirm this. I managed to fix this by adding the nvidia driver ppa, and upgrading my nvidia driver. I suspect what happened was that when I installed CUDA it installed its own driver, which apparently causes issues. See NVIDIA/nvidia-docker#802. The instructions for nvidia-docker AWS provisioning are not up to date, and I didn't realise before installing nvidia-docker version 1, so I had to remove it and install version 2, which probably left some cruft on my system. Closing this now.

from cuda.

NickHu commented on July 22, 2024

Actually, I only got it to work with the nvidia/cuda image, which is on CUDA version 9.0.176, but I still get the error on tmcdonell/accelerate-llvm which is on CUDA version 9.2.148.

from cuda.

NickHu commented on July 22, 2024

For clarity, on the host machine I have version 396.54 of the nvidia driver, and no cuda-toolkit installed (this is how nvidia-docker recommends the machine is set up). I suspect perhaps I just need to install the version of the nvidia driver that the cuda-toolkit in the tmcdonell/accelerate-llvm image expects, but I don't know how to ascertain that.

from cuda.

NickHu commented on July 22, 2024

Funnily enough, it also works fine on nvidia/cuda:9.2-devel-ubuntu16.04 - perhaps this image was generated more recently than tmcdonell/accelerate-llvm, and is compatible with the nvidia 396.54 driver. I'll try rebuilding tmcdonell/accelerate-llvm and see if it works, but I wonder why this seems so unstable.
EDIT: This didn't seem to fix anything.

from cuda.

tmcdonell commented on July 22, 2024

Hm, interesting. I haven't played with the docker images in a while. If you manage to fix it and could send a patch that would be awesome. Otherwise, I'll see about setting up an aws account and trying it out on a p3.2xlarge.

from cuda.

NickHu commented on July 22, 2024

https://github.com/tmcdonell/accelerate-llvm/blob/6400e4fc20f2091c3a928eb5678a4e3f8166a4c5/Dockerfile#L14 seems to be the culprit; when this line is removed, I can compile and run deviceQueryDrv fine

from cuda.

NickHu commented on July 22, 2024

Unfortunately, even though that works now, when I try to run my accelerate program, I get the following error

*** Warning: Unknown CUDA device compute capability: 7.0
*** Please submit a bug report at https://github.com/tmcdonell/cuda/issues

from cuda.

tmcdonell commented on July 22, 2024

Ah I guess that the missing libcuda.so.1 link has been fixed with the newer cuda release (or not needed anymore?)

Oh, I only recently added the device properties for compute 7x to the cuda package, but looks like I did not update the stack files for accelerate-llvm to point to it yet.

I pushed patches containing both these changes just now, the new image is building and should be done soon...

from cuda.

NickHu commented on July 22, 2024

Hmm, looks like removing that line causes the Docker build to fail at the stage of compiling accelerate - what I did was I removed it in the existing image, and allowed the linker to find

libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f37643c3000)

in my compiled binaries.

I don't know where each of these shared libraries come from, but they are different:

root@36d65cc14b9d:/opt/accelerate-llvm/kmeans# md5sum /usr/local/cuda/lib64/stubs/libcuda.so
1725f80d0ef5e44dc61c8d81f02da761  /usr/local/cuda/lib64/stubs/libcuda.so
root@36d65cc14b9d:/opt/accelerate-llvm/kmeans# md5sum /usr/lib/x86_64-linux-gnu/libcuda.so.1
0161af92fdca2cec1bd72b0ade604f05  /usr/lib/x86_64-linux-gnu/libcuda.so.1

Maybe the easy fix is to symlink to the second one instead?

from cuda.

NickHu commented on July 22, 2024

I have confirmed that with the latest image, if I simply rm /usr/local/cuda/lib64/libcuda.so.1, the linker manages to find libcuda.so.1 as above and my accelerate programs run correctly.

from cuda.

tmcdonell commented on July 22, 2024

It is strange that building accelerate-llvm-ptx does not automatically find the library at /usr/lib/x86_64-linux-gnu/libcuda.so.1 like it does when running the programs. Maybe you are right, and the correct solution is to instead create the symlink to that point, rather than the one in /usr/local/cuda... is I did.

from cuda.

tmcdonell commented on July 22, 2024

@tmcdonell I can confirm this. I managed to fix this by adding the nvidia driver ppa, and upgrading my nvidia driver.

@NickHu to clarify, did you install the nvidia-396 package inside the docker image, or on your host machine?

from cuda.

Driver failure on AWS p3.2xlarge instance about cuda HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent