Coder Social home page Coder Social logo

Comments (18)

tmcdonell avatar tmcdonell commented on July 22, 2024

I haven't tried running on aws, so I'm not sure what might be going on, and the error reported is not so informative ):

Just as a sanity check, can you try compiling and running the deviceQueryDrv program from the CUDA samples?

from cuda.

NickHu avatar NickHu commented on July 22, 2024

It compiles fine but gives the same error:

root@7d05fbb8d8d5:/opt/accelerate-llvm# git clone https://github.com/tmcdonell/cuda.git
Cloning into 'cuda'...
remote: Counting objects: 4345, done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 4345 (delta 1), reused 6 (delta 1), pack-reused 4335
Receiving objects: 100% (4345/4345), 1.79 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (2384/2384), done.
Checking connectivity... done.
root@7d05fbb8d8d5:/opt/accelerate-llvm# stack ghc cuda/examples/src/deviceQueryDrv/DeviceQuery.hs
[1 of 1] Compiling Main             ( cuda/examples/src/deviceQueryDrv/DeviceQuery.hs, cuda/examples/src/deviceQueryDrv/DeviceQuery.o )
Linking cuda/examples/src/deviceQueryDrv/DeviceQuery ...
root@7d05fbb8d8d5:/opt/accelerate-llvm# cuda/examples/src/deviceQueryDrv/DeviceQuery
DeviceQuery: Status.toEnum: Cannot match -1
CallStack (from HasCallStack):
  error, called at src/Foreign/CUDA/Driver/Error.chs:372:22 in cuda-0.10.0.0-Lq313TS76CJ6ufZOzm0zPz:Foreign.CUDA.Driver.Error

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

Oh, I meant, the one which ships from NVIDIA as part of the CUDA toolkit. It probably lives in /usr/local/cuda/samples/1_Utilities/deviceQueryDrv ?

from cuda.

NickHu avatar NickHu commented on July 22, 2024

I can't find the sample you mentioned anywhere on the filesystem, or in https://github.com/NVIDIA/cuda-samples.git,
but it does have deviceQuery, but it runs with some sort of failure

root@7d05fbb8d8d5:/tmp/cuda-samples# bin/x86_64/linux/release/deviceQuery
bin/x86_64/linux/release/deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

from cuda.

NickHu avatar NickHu commented on July 22, 2024

I found a version in https://github.com/zchee/cuda-sample.git, and it compiles with some minor tweaking (commenting out line 202: GENCODE_FLAGS += -gencode arch=compute_20,code=compute_20), but its output is also a failure

root@7d05fbb8d8d5:/tmp/cuda-sample/1_Utilities/deviceQueryDrv# ./deviceQueryDrv
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
cuInit(0) returned -1
-> (null)
Result = FAIL

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

Okay, I think that CUDA is not installed correctly on this system. It looks like that somebody installed a new version of the CUDA toolkit but did not update the device driver at the same time to match. Try reinstalling / updating the driver?

from cuda.

NickHu avatar NickHu commented on July 22, 2024

@tmcdonell I can confirm this. I managed to fix this by adding the nvidia driver ppa, and upgrading my nvidia driver. I suspect what happened was that when I installed CUDA it installed its own driver, which apparently causes issues. See NVIDIA/nvidia-docker#802. The instructions for nvidia-docker AWS provisioning are not up to date, and I didn't realise before installing nvidia-docker version 1, so I had to remove it and install version 2, which probably left some cruft on my system. Closing this now.

from cuda.

NickHu avatar NickHu commented on July 22, 2024

Actually, I only got it to work with the nvidia/cuda image, which is on CUDA version 9.0.176, but I still get the error on tmcdonell/accelerate-llvm which is on CUDA version 9.2.148.

from cuda.

NickHu avatar NickHu commented on July 22, 2024

For clarity, on the host machine I have version 396.54 of the nvidia driver, and no cuda-toolkit installed (this is how nvidia-docker recommends the machine is set up). I suspect perhaps I just need to install the version of the nvidia driver that the cuda-toolkit in the tmcdonell/accelerate-llvm image expects, but I don't know how to ascertain that.

from cuda.

NickHu avatar NickHu commented on July 22, 2024

Funnily enough, it also works fine on nvidia/cuda:9.2-devel-ubuntu16.04 - perhaps this image was generated more recently than tmcdonell/accelerate-llvm, and is compatible with the nvidia 396.54 driver. I'll try rebuilding tmcdonell/accelerate-llvm and see if it works, but I wonder why this seems so unstable.
EDIT: This didn't seem to fix anything.

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

Hm, interesting. I haven't played with the docker images in a while. If you manage to fix it and could send a patch that would be awesome. Otherwise, I'll see about setting up an aws account and trying it out on a p3.2xlarge.

from cuda.

NickHu avatar NickHu commented on July 22, 2024

https://github.com/tmcdonell/accelerate-llvm/blob/6400e4fc20f2091c3a928eb5678a4e3f8166a4c5/Dockerfile#L14 seems to be the culprit; when this line is removed, I can compile and run deviceQueryDrv fine

from cuda.

NickHu avatar NickHu commented on July 22, 2024

Unfortunately, even though that works now, when I try to run my accelerate program, I get the following error

*** Warning: Unknown CUDA device compute capability: 7.0
*** Please submit a bug report at https://github.com/tmcdonell/cuda/issues

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

Ah I guess that the missing libcuda.so.1 link has been fixed with the newer cuda release (or not needed anymore?)

Oh, I only recently added the device properties for compute 7x to the cuda package, but looks like I did not update the stack files for accelerate-llvm to point to it yet.

I pushed patches containing both these changes just now, the new image is building and should be done soon...

from cuda.

NickHu avatar NickHu commented on July 22, 2024

Hmm, looks like removing that line causes the Docker build to fail at the stage of compiling accelerate - what I did was I removed it in the existing image, and allowed the linker to find

libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f37643c3000)

in my compiled binaries.

I don't know where each of these shared libraries come from, but they are different:

root@36d65cc14b9d:/opt/accelerate-llvm/kmeans# md5sum /usr/local/cuda/lib64/stubs/libcuda.so
1725f80d0ef5e44dc61c8d81f02da761  /usr/local/cuda/lib64/stubs/libcuda.so
root@36d65cc14b9d:/opt/accelerate-llvm/kmeans# md5sum /usr/lib/x86_64-linux-gnu/libcuda.so.1
0161af92fdca2cec1bd72b0ade604f05  /usr/lib/x86_64-linux-gnu/libcuda.so.1

Maybe the easy fix is to symlink to the second one instead?

from cuda.

NickHu avatar NickHu commented on July 22, 2024

I have confirmed that with the latest image, if I simply rm /usr/local/cuda/lib64/libcuda.so.1, the linker manages to find libcuda.so.1 as above and my accelerate programs run correctly.

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

It is strange that building accelerate-llvm-ptx does not automatically find the library at /usr/lib/x86_64-linux-gnu/libcuda.so.1 like it does when running the programs. Maybe you are right, and the correct solution is to instead create the symlink to that point, rather than the one in /usr/local/cuda... is I did.

from cuda.

tmcdonell avatar tmcdonell commented on July 22, 2024

@tmcdonell I can confirm this. I managed to fix this by adding the nvidia driver ppa, and upgrading my nvidia driver.

@NickHu to clarify, did you install the nvidia-396 package inside the docker image, or on your host machine?

from cuda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.