Comments (6)
Which command did you run? Evaluation or training?
from once-for-all.
@Lyken17 It was in the train.
from once-for-all.
@Lyken17 Now I am trying to set up horovod with docker using the following docker image -
horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu.
When I ran with python3 train_ofa_net.py, Its giving me -
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
When I ran with - horovodrun -np 32 -H localhost:1 python train_ofa_net.py
2020-07-17 05:50:18.482988: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-07-17 05:50:18.483070: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-07-17 05:50:18.483081: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
There are not enough slots available in the system to satisfy the 32 slots
that were requested by the application:
python
Either request fewer slots for your application, or make more slots available
for use.
I tried to build the docker from DOCKERFILE as mentioned in your repo - sudo docker build -t horovod:latest horovod-docker-gpu
Sending build context to Docker daemon 5.12kB
Step 1/23 : FROM nvidia/cuda:10.1-devel-ubuntu18.04
---> 9e47e9dfcb9a
Step 2/23 : ENV TENSORFLOW_VERSION 2.1.0
---> Using cache
---> 6854f7c30a93
Step 3/23 : ENV PYTORCH_VERSION 1.4.0
---> Using cache
---> da2ca2208dfe
Step 4/23 : ENV TORCHVISION_VERSION 0.5.0
---> Using cache
---> b74dc652c42f
Step 5/23 : ENV CUDNN_VERSION 7.6.5.32-1+cuda10.1
---> Using cache
---> a7922a029f57
Step 6/23 : ENV NCCL_VERSION 2.4.8-1+cuda10.1
---> Using cache
---> 2103056392cc
Step 7/23 : ENV MXNET_VERSION 1.6.0
---> Using cache
---> 6830832cbd47
Step 8/23 : ARG python=3.6
---> Using cache
---> c143f1e18d0f
Step 9/23 : ENV PYTHON_VERSION ${python}
---> Using cache
---> e3278577073e
Step 10/23 : SHELL /bin/bash -cu
---> Using cache
---> f9737a939837
Step 11/23 : RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends build-essential cmake g++-4.8 git curl vim wget ca-certificates libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} libjpeg-dev libpng-dev python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-distutils librdmacm1 libibverbs1 ibverbs-providers
---> Running in ac8a7c99eadb
Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease
Temporary failure resolving 'developer.download.nvidia.com'
Err:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://security.ubuntu.com/ubuntu bionic-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease
Temporary failure resolving 'developer.download.nvidia.com'
Err:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/InRelease Temporary failure resolving 'developer.download.nvidia.com'
W: Failed to fetch https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/InRelease Temporary failure resolving 'developer.download.nvidia.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Reading package lists...
Building dependency tree...
Reading state information...
Package git is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
Package cmake is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
E: Package 'cmake' has no installation candidate
E: Unable to locate package g++-4.8
E: Couldn't find any package by glob 'g++-4.8'
E: Couldn't find any package by regex 'g++-4.8'
E: Package 'git' has no installation candidate
E: Unable to locate package curl
E: Unable to locate package vim
E: Unable to locate package wget
E: Unable to locate package libcudnn7
E: Unable to locate package libjpeg-dev
E: Unable to locate package libpng-dev
E: Unable to locate package python3.6
E: Couldn't find any package by glob 'python3.6'
E: Couldn't find any package by regex 'python3.6'
E: Unable to locate package python3.6-dev
E: Couldn't find any package by glob 'python3.6-dev'
E: Couldn't find any package by regex 'python3.6-dev'
E: Unable to locate package python3.6-distutils
E: Couldn't find any package by glob 'python3.6-distutils'
E: Couldn't find any package by regex 'python3.6-distutils'
E: Unable to locate package librdmacm1
E: Unable to locate package libibverbs1
E: Unable to locate package ibverbs-providers
The command '/bin/bash -cu apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends build-essential cmake g++-4.8 git curl vim wget ca-certificates libcudnn7=${CUDNN_VERSION} libnccl2=${NCCL_VERSION} libnccl-dev=${NCCL_VERSION} libjpeg-dev libpng-dev python${PYTHON_VERSION} python${PYTHON_VERSION}-de
from once-for-all.
it seems like an error from your environment rather than our code.
from once-for-all.
@Lyken17 It looks like the same to me. Can you release the docker file for the enivironment setup or may be release instruction for environment setup ?
from once-for-all.
We didn't use docker. We run experiments on servers with CUDA + PyTorch + Horovod installed. You may want to go to docker community for help as it looks like the Nvidia driver is not enabled in your container.
I'm going to close this issue as it is not related with ofa codebase.
from once-for-all.
Related Issues (20)
- KeyError from MBv3LatencyTable in EvolutionFinder
- ImportError: cannot import name 'MyRandomResizedCrop' HOT 1
- How to train my own acc predictor and efficiency predictor HOT 2
- All memory on one card
- How to cal calculate the mobile_trim.yaml?
- KeyError: <InterpolationMode.BILINEAR: 'bilinear'> HOT 4
- expand_ratio BUG?
- Using Once-for-all for a custom CNN HOT 1
- torch.hub example is broken HOT 2
- error when custom sub sampling a model HOT 1
- Each network in the child pool has the same architecture?
- About the pre-trained hypernetwork used for the results in Table 1, is it ofa_mbv3_d234_e346_k357_w1.0 or ofa_mbv3_d234_e346_k357_w1.2?
- Missing dependency numpy
- imagenet_1k.zip HOT 1
- [Easy fix] OFA Tutorial On colab
- Can't download `ofa_resnet50` super network HOT 3
- How do i specialize OFA for cifar10
- Resnet50 Access Denied
- LatencyTable Target Hardware Error
- Additional requirements for running the notebool
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from once-for-all.