elderresearch / gpu_docker Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 2.0 68 KB

dockerfiles for our dc office gpu development machine

Python 27.32% Shell 3.09% R 7.62% Dockerfile 61.97%

gpu_docker's People

Contributors

Stargazers

Watchers

Forkers

enmyj xero-lib

gpu_docker's Issues

Create repeatable hash of password for connecting to remote kernel from Atom

Atom Hydrogen only has options to pass a session token or cookie when connecting to a remote kernel. The IPython.passwd() function adds a random salt to the sha1 hash making it non-repeatable and therefore the Hydrogen config params must change with each new session.

A workaround is to create our own token by using a hash of the password and a predetermined salt.

salt = str(os.getenv('PORT', 8888))
h = hashlib.new('sha1')
h.update(password.encode() + salt.encode())
c.NotebookApp.token = h.hexdigest()

install `ssh` into base container so users can use it for `git` repos

Flask app to create a GPU session allows empty PWDs

To duplicate the issue:

Enter Devbox user name
Click enter
GPU session gets created
Select the session and click the Kill Session button without a password
GPU session gets deleted.

limit the user home directory size

this is more a system issue than a gpu issue, but it feels like the best place to keep track of it.

users are not being assigned to local groups, causing /data permission errors

proof: docker exec into any running session and check groups

Install Darknet

The creator of YOLO has a C-based, GPU-capable deep neural network framework called Darknet, available at https://pjreddie.com/darknet/. It would be nice to have on the GPU box to run object detection in addition to PyTorch, Keras, etc. This would likely be directly applicable to Project Iris.

Install ONNX

Open Neural Network Exchange (ONNX) is an important standardization effort that allows models to be exchanged (import/export) between frameworks. As far as I can tell most developers of frameworks are going in this direction and support the idea. I believe all but Tensorflow support ONNX.

https://github.com/onnx/onnx

install `tensorflow-hub`

package description here

I'm not sure that this is strictly necessary. will be investigating independently today to form an opinion. I'd welcome anyone else's experience using tf hub

integrate existing dev / prod environment with portainer

end goal is to have a button each user clicks to create a persistent container

Add text mining libraries to the image(s)

We need to include nltk, SpaCy, and gensim in the python environments:

Justification: Most DNN frameworks do not include fast text mining data prep tools and developers recommend to use one or a combo of the above libraries.

docker containers launched without explicit tag not being listed as active sessions

astewart observed: a running docker image for eri_dev was not appearing in the list of currently active sessions in launch.py or on the launch app.

quick review indicated this occurred because the active_eri_images function in launch.py was looking for eri_dev:latest, but the image itself was launched (from the cli) as just eri_dev -- the :latest was implicit.

easy fix

figure out testing process

I've put together bare-minimum tests for each image (asserting python packages are installed, basically); we should figure out a way to run these in an automated way, either on image creation or afterwards as part of a CI process

Install pyodbc on the GPU box containers

This is needed so that we can write to the Client Database faster. User gets a GCC error.

tbradberry@6016a8c58149:~/repos/ltv$ pip install --user pyodbc
Collecting pyodbc
Using cached https://files.pythonhosted.org/packages/0f/aa/733a4326bfdef7deff954aa109ded6acf29d802a91fd87eedf6fc46fd91c/pyodbc-4.0.25.tar.gz
Building wheels for collected packages: pyodbc
Building wheel for pyodbc (setup.py) ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /tmp/pip-wheel-rnpp1ziq --python-tag cp35:
running bdist_wheel
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Failed building wheel for pyodbc
Running setup.py clean for pyodbc
Failed to build pyodbc
Installing collected packages: pyodbc
Running setup.py install for pyodbc ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt --single-version-externally-managed --compile --user --prefix=:
running install
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt

allow users to establish tensorboard sessions

for the time being this is a placeholder for setting up tensorboard sessions for users with running jupyter notebook containers. the simplest thing to do is just set up a similar port scenario (e.g. 6006 to prod, 6007 to dev, and then 6008 - 6017 for the non-gpu sessions). I don't know if it's overkill or not, and we definitely have to contend with the gpu resource issues. perhaps we pre-expose the port so that power users can start up their own tensorboard session from the cli

Add build-essential and swig packages to image

In order to install SMAC3 (a state-of-the-art hyperparameter optimizer), I need build-essential and swig to be added to the docker image. They have to be installed with apt-get, not pip. See https://automl.github.io/SMAC3/stable/installation.html.

control personal python environments

Perhaps launch.py can ask for the location of a requirements.txt for personal packages that are installed via pip install -r --user ~/path/to/requirements.txt? However, this will place some of the onus on users to maintain their own requirements.txt file for their project.

R package install requests

rsample
recipes
caret
keras
h2o
broom
devtools
usethis
psych
ranger
lubridate
reticulate
ModelMetrics
tensorflow
jsonlite
rvest
rmarkdown
knitr

Install Python cudatoolkit

numba works fine for CPU. When trying to run an example GPU command, I receive the following:

NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit: library nvvm not found

SpaCy package broken

Getting an error while trying to load the basic English Model (module)

nlp = spacy.load('en')

Output:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-21-76abff010c5f> in <module>()
----> 1 nlp = spacy.load('en')

/usr/local/lib/python3.5/dist-packages/spacy/__init__.py in load(name, **overrides)
     13     if depr_path not in (True, False, None):
     14         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15     return util.load_model(name, **overrides)
     16 
     17 

/usr/local/lib/python3.5/dist-packages/spacy/util.py in load_model(name, **overrides)
    117     elif hasattr(name, 'exists'):  # Path or Path-like to model data
    118         return load_model_from_path(name, **overrides)
--> 119     raise IOError(Errors.E050.format(name=name))
    120 
    121 

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Seems like

python -m spacy download en

will solve the problem.

See:
explosion/spaCy#1721
or similar github issues

Serve the launchapp with Apache

In order to get rid of all this screen business. I propose configuring the launchapp to be served by apache. Because of docker permissions, apache will need to be configured such that the process runs as an appropriately privileged user.

current active image list is not correctly resolving the image type

because the images for dev, nogpu_dev, and prod are all identical, the script which is collecting the particular type of image via simple docker commands cannot tell which type was actually selected to launch that image.

for example, if you launched an image of type eri_nogpu_dev, the tags assocaited with that image are ['eri_dev:latest', 'eri_nogpu_dev:latest', 'eri_prod:latest'].

to make the list usable, we need a way to disambiguate them. currently we will always see only eri_dev:latest as the tag (and hence a eri_gpu type).

cname for eri-gpu used in href not valid for vpn connetions

I'm not sure if this is a joseph problem or a me problem (as in, what should I expect?)

url eri-gpu:8888 is not valid when connected via vpn, needs the fqdn (eri-gpu.cho.elderresearch.com)

Keras visualization not working

When running the following code:

from keras.utils import plot_model
plot_model(model, to_file='model.png')

I receive the following error:
OSError: pydot failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.

From what I've read, GraphViz has to be installed with apt-get, not just with pip.

cache directory persistence in docker containers

in somewhat limited experience, I've found that one of the most important features of the dl libraries has been the caching / model checkpointing components. for example, shipping a trained model requires, at some point, persisting at least the weights, if not a pickled repr. of the entire model.

in practice I've always done this by creating a system-wide cache location a la /cache, or /var/cache/eri, though obviously the exact path isn't that important.

I would argue this is not stuff we should be putting in the /data directory simple on the basis of separation of concerns -- what we are putting here are artifacts and products rather than inputs. for space constraint reasons

I propose we basically repeat the /data setup with a new directory for caching results

Cuda 10 upgrade

Upgrading to CUDA 10 requires changing our driver version. See this answer on the nvidia-docker repo NVIDIA/nvidia-docker#829 (comment)

install topicmodels and gsl for r docker image

being used by lkent for her bench project

Install git

Users may need command line access to git. Install git for the deep-learning images.

Install a new DL framework: MXNET

Apache MXNET is highly scalable, has better CPU and Memory utilization and directly written in C++.

In most benchmarking papers I read, MXNET beats tensorflow, pytorch and caffe2 in terms of performance. API looks very much like PyTorch.

https://mxnet.apache.org/

Darknet build failing in eri_python image

/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
make: *** [libdarknet.so] Error 1

apparently, the libcuda library can't be found.

Kill Instance

I've forgotten the password for my no_gpu instance, please kill it

pre-install tensorflow models

tensorflow has separated out the examples and tutorials files used throughout its documentation into a single separately-versioned github repository. it is a sizable enough that it doesn't make sense for individual users to have, but also important enough that multiple users will likely want it at some point

this would be a prime target for something that belongs in a /shared directory. this may be a part of a bigger project in which we rename the /data directory to be /shared and then start creating conceptually isolated subdirectories (e.g. /shared/data, /shared/documentation)

Password protect Rstudio environments

On the ERI-GPU front-end, any user can click on any launched "rstudio url" and make changes to the running rstudio environment

Update Python to 3.7

Containers have 3.5

Title says it all.

Install Python numba for GPU

Supposedly, numba's GPU support was sufficient to significantly accelerate network graph operations in the latest Belair HPC deliverable. It would be nice to have it available to experiment with.

set up weekly / nightly rebuilds of images

could easily be done right now using cron, but probably belongs in some more formal CI solution

Password Protect Rstudio Environments

Do you want to request a feature or report a bug?
Bug
What is the current behavior?
Any user can click on an running "rstudio url" and view the contents of the rstudio session and make changes
If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.

Go to the running ERI-GPU webpage
Click on any running rstudio url
Delete everything from environment and laugh

What is the expected behavior?
Rstudio urls should be password protected the same way the jupyter notebooks are
What is the motivation / use case for changing the behavior?
security
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)

R-package request

future
digest
globals
listenv
parallel

Add R-Studio Server to non-gpu container

I'm interested in being able to launch an R-Studio non-gpu instance to leverage the machine's cores/RAM. At this time, it looks like only jupyter notebook is available.

create an R / R+python image

for the r-heads out there

isolate gpus in images

currently we are unable to isolate GPUs in our containers. this is done relatively easy with the nvidia docker container images or things which have inherited from those (such as tensorflow's gpu image), e.g.

NV_GPU=1 nvidia-docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-gpu nvidia-smi

we need to be able to do the same

MS SQL Server Driver not found

This has not been resolved. We are getting basically this same error

https://stackoverflow.com/questions/44527452/cant-open-lib-odbc-driver-13-for-sql-server-sym-linking-issue

It appears like MS SQL ODBC drivers are not installed on the container.

https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017

FYI.. I have installed them on the root machine because we needed it ASAP but has not worked.

Originally posted by @oersoy1 in #52 (comment)

Reconfigure launch app for new GPU cards

Success is:

one container pointing to 2 cards (what's installed already), call it Mult_GPU or something like that
one container pointing to the 3rd GPU, call it GPU_3 or something like that
one container pointing to the 4th GPU, call it GPU_4 or whatever...

So we will have 3 containers that have GPU access, available for training models.
This dev/prod was a good idea but in the end did not matter. lmk what you think
My thinking is that if you are learning or do not have a lot of data to train, use the single GPU instances if they are available. If you are working on a lot data or client data, use the multiple GPU instance and run the algorithm on 2 GPUs.

We could also have a 4 GPU instance for really serious work, i.e. mountains of text data. But that would mean no one else can work on it. But we can say you need to get permission from the admins to make sure that will be OK.

Containers do not see GPUs

Do you want to request a feature or report a bug?
A Bug
What is the current behavior?
Event though the user creates a container by attaching one or more GPUs, the DL framework (whether it is TF or Pytorch) does not see the GPU. We have duplicated this problem in both frameworks. User tries to train a model on the GPU and all CPU cores are utilized instead.
If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.

import torch
torch.cuda.device_count()

returns 0

What is the expected behavior?
In this case my container has 2. Should return 2.
What is the motivation / use case for changing the behavior?
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)
This may be related with the new CUDA 10 Upgrade.. When I installed the new GPU cards on the machine 2 months ago I tested the frameworks and everything worked fine.

See here for more information on the Pytorch multi gpu processing.

create and expose shared data directory

our goal is to have one area where all users can store data in a shared space, and to administer that volume

Bus error (DataLoader worker killed) when using PyTorch in docker

Relevant error here:

Epoch number 0
 Current loss 2.129094362258911

Epoch number 1
 Current loss 1.1615623235702515

Epoch number 2
 Current loss 1.1757559776306152

Epoch number 3
 Current loss 1.1106539964675903

Epoch number 4
 Current loss 1.1064958572387695

Epoch number 5
 Current loss 1.1496989727020264

Epoch number 6
 Current loss 1.1117830276489258

Epoch number 7
 Current loss 1.1182820796966553

Epoch number 8
 Current loss 1.109954833984375

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-48-b071e9085377> in <module>()
      1 for epoch in range(0, Config.train_number_epochs):
----> 2     for i, data in enumerate(train_dataloader,0):
      3         img0, img1, label = data
      4         img0, img1, label = img0.cuda(), img1.cuda(), label.cuda()
      5         optimizer.zero_grad()

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    278         while True:
    279             assert (not self.shutdown and self.batches_outstanding > 0)
--> 280             idx, batch = self._get_batch()
    281             self.batches_outstanding -= 1
    282             if idx != self.rcvd_idx:

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in _get_batch(self)
    257                 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
    258         else:
--> 259             return self.data_queue.get()
    260 
    261     def __next__(self):

/usr/lib/python3.5/multiprocessing/queues.py in get(self)
    341     def get(self):
    342         with self._rlock:
--> 343             res = self._reader.recv_bytes()
    344         # unserialize the data after having released the lock
    345         return ForkingPickler.loads(res)

/usr/lib/python3.5/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

/usr/lib/python3.5/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/lib/python3.5/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    176         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    177         # Python can still get and update the process status successfully.
--> 178         _error_if_any_worker_fails()
    179         if previous_handler is not None:
    180             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 451) is killed by signal: Bus error.

Related issue on pytorch pytorch/pytorch#2244.

It appears that we need to increase the shared memory size when launching the containers by passing --shm-size=8G to docker run.

Install Python OpenCV

Can we get the OpenCV library (cv2) for Python added?

Permission denied err on eri_nogpu_dev

Can’t seem to create any files in Jupyter — “Permission denied: Untitled.ipynb” (eri_nogpu_dev:latest)

For the same user tshafer other docker images work fine.

install pyarrow python library

pip install -U feather-format

create isolated landing directory for jupyter notebooks on a per-user basis

when a user launches a container, it would be good if we could have a user-level directory in which their notebooks live, perhaps in their home directory, and for their jupyter notebook server to start with that directory as their working directory.

this is something we probably need to have on the container level, so it might need to be parameterized in the image itself I think.

Change default shell to bash

The default shell is currently sh for all users. Change the default shell to bash.