Coder Social home page Coder Social logo

Comments (6)

Lyken17 avatar Lyken17 commented on May 20, 2024

Which command did you run? Evaluation or training?

from once-for-all.

abhiagwl4262 avatar abhiagwl4262 commented on May 20, 2024

@Lyken17 It was in the train.

from once-for-all.

abhiagwl4262 avatar abhiagwl4262 commented on May 20, 2024

@Lyken17 Now I am trying to set up horovod with docker using the following docker image -

horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu.

When I ran with python3 train_ofa_net.py, Its giving me -
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from

When I ran with - horovodrun -np 32 -H localhost:1 python train_ofa_net.py

2020-07-17 05:50:18.482988: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-07-17 05:50:18.483070: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-07-17 05:50:18.483081: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

There are not enough slots available in the system to satisfy the 32 slots
that were requested by the application:
python

Either request fewer slots for your application, or make more slots available
for use.

I tried to build the docker from DOCKERFILE as mentioned in your repo - sudo docker build -t horovod:latest horovod-docker-gpu

Sending build context to Docker daemon 5.12kB
Step 1/23 : FROM nvidia/cuda:10.1-devel-ubuntu18.04
---> 9e47e9dfcb9a
Step 2/23 : ENV TENSORFLOW_VERSION 2.1.0
---> Using cache
---> 6854f7c30a93
Step 3/23 : ENV PYTORCH_VERSION 1.4.0
---> Using cache
---> da2ca2208dfe
Step 4/23 : ENV TORCHVISION_VERSION 0.5.0
---> Using cache
---> b74dc652c42f
Step 5/23 : ENV CUDNN_VERSION 7.6.5.32-1+cuda10.1
---> Using cache
---> a7922a029f57
Step 6/23 : ENV NCCL_VERSION 2.4.8-1+cuda10.1
---> Using cache
---> 2103056392cc
Step 7/23 : ENV MXNET_VERSION 1.6.0
---> Using cache
---> 6830832cbd47
Step 8/23 : ARG python=3.6
---> Using cache
---> c143f1e18d0f
Step 9/23 : ENV PYTHON_VERSION ${python}
---> Using cache
---> e3278577073e
Step 10/23 : SHELL /bin/bash -cu
---> Using cache
---> f9737a939837
Step 11/23 : RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends build-essential cmake g++-4.8 git curl vim wget ca-certificates libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} libjpeg-dev libpng-dev python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-distutils librdmacm1 libibverbs1 ibverbs-providers
---> Running in ac8a7c99eadb
Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease
Temporary failure resolving 'developer.download.nvidia.com'
Err:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease
Temporary failure resolving 'developer.download.nvidia.com'
Err:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/InRelease Temporary failure resolving 'developer.download.nvidia.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/InRelease Temporary failure resolving 'developer.download.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Reading package lists...
Building dependency tree...
Reading state information...
Package git is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

Package cmake is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'cmake' has no installation candidate
E: Unable to locate package g++-4.8
E: Couldn't find any package by glob 'g++-4.8'
E: Couldn't find any package by regex 'g++-4.8'
E: Package 'git' has no installation candidate
E: Unable to locate package curl
E: Unable to locate package vim
E: Unable to locate package wget
E: Unable to locate package libcudnn7
E: Unable to locate package libjpeg-dev
E: Unable to locate package libpng-dev
E: Unable to locate package python3.6
E: Couldn't find any package by glob 'python3.6'
E: Couldn't find any package by regex 'python3.6'
E: Unable to locate package python3.6-dev
E: Couldn't find any package by glob 'python3.6-dev'
E: Couldn't find any package by regex 'python3.6-dev'
E: Unable to locate package python3.6-distutils
E: Couldn't find any package by glob 'python3.6-distutils'
E: Couldn't find any package by regex 'python3.6-distutils'
E: Unable to locate package librdmacm1
E: Unable to locate package libibverbs1
E: Unable to locate package ibverbs-providers
The command '/bin/bash -cu apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends build-essential cmake g++-4.8 git curl vim wget ca-certificates libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} libjpeg-dev libpng-dev python${PYTHON_VERSION} python${PYTHON_VERSION}-de

from once-for-all.

Lyken17 avatar Lyken17 commented on May 20, 2024

it seems like an error from your environment rather than our code.

from once-for-all.

abhiagwl4262 avatar abhiagwl4262 commented on May 20, 2024

@Lyken17 It looks like the same to me. Can you release the docker file for the enivironment setup or may be release instruction for environment setup ?

from once-for-all.

Lyken17 avatar Lyken17 commented on May 20, 2024

We didn't use docker. We run experiments on servers with CUDA + PyTorch + Horovod installed. You may want to go to docker community for help as it looks like the Nvidia driver is not enabled in your container.

I'm going to close this issue as it is not related with ofa codebase.

from once-for-all.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.