Coder Social home page Coder Social logo

gpu_docker's People

Contributors

semperstew avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

enmyj xero-lib

gpu_docker's Issues

Create repeatable hash of password for connecting to remote kernel from Atom

Atom Hydrogen only has options to pass a session token or cookie when connecting to a remote kernel. The IPython.passwd() function adds a random salt to the sha1 hash making it non-repeatable and therefore the Hydrogen config params must change with each new session.

A workaround is to create our own token by using a hash of the password and a predetermined salt.

salt = str(os.getenv('PORT', 8888))
h = hashlib.new('sha1')
h.update(password.encode() + salt.encode())
c.NotebookApp.token = h.hexdigest()

Install Darknet

The creator of YOLO has a C-based, GPU-capable deep neural network framework called Darknet, available at https://pjreddie.com/darknet/. It would be nice to have on the GPU box to run object detection in addition to PyTorch, Keras, etc. This would likely be directly applicable to Project Iris.

Install ONNX

Open Neural Network Exchange (ONNX) is an important standardization effort that allows models to be exchanged (import/export) between frameworks. As far as I can tell most developers of frameworks are going in this direction and support the idea. I believe all but Tensorflow support ONNX.

https://github.com/onnx/onnx

install `tensorflow-hub`

package description here

I'm not sure that this is strictly necessary. will be investigating independently today to form an opinion. I'd welcome anyone else's experience using tf hub

Add text mining libraries to the image(s)

We need to include nltk, SpaCy, and gensim in the python environments:

Justification: Most DNN frameworks do not include fast text mining data prep tools and developers recommend to use one or a combo of the above libraries.

docker containers launched without explicit tag not being listed as active sessions

astewart observed: a running docker image for eri_dev was not appearing in the list of currently active sessions in launch.py or on the launch app.

quick review indicated this occurred because the active_eri_images function in launch.py was looking for eri_dev:latest, but the image itself was launched (from the cli) as just eri_dev -- the :latest was implicit.

easy fix

figure out testing process

I've put together bare-minimum tests for each image (asserting python packages are installed, basically); we should figure out a way to run these in an automated way, either on image creation or afterwards as part of a CI process

Install pyodbc on the GPU box containers

This is needed so that we can write to the Client Database faster. User gets a GCC error.

tbradberry@6016a8c58149:~/repos/ltv$ pip install --user pyodbc
Collecting pyodbc
Using cached https://files.pythonhosted.org/packages/0f/aa/733a4326bfdef7deff954aa109ded6acf29d802a91fd87eedf6fc46fd91c/pyodbc-4.0.25.tar.gz
Building wheels for collected packages: pyodbc
Building wheel for pyodbc (setup.py) ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /tmp/pip-wheel-rnpp1ziq --python-tag cp35:
running bdist_wheel
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1


Failed building wheel for pyodbc
Running setup.py clean for pyodbc
Failed to build pyodbc
Installing collected packages: pyodbc
Running setup.py install for pyodbc ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt --single-version-externally-managed --compile --user --prefix=:
running install
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1


Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt

allow users to establish tensorboard sessions

for the time being this is a placeholder for setting up tensorboard sessions for users with running jupyter notebook containers. the simplest thing to do is just set up a similar port scenario (e.g. 6006 to prod, 6007 to dev, and then 6008 - 6017 for the non-gpu sessions). I don't know if it's overkill or not, and we definitely have to contend with the gpu resource issues. perhaps we pre-expose the port so that power users can start up their own tensorboard session from the cli

control personal python environments

Perhaps launch.py can ask for the location of a requirements.txt for personal packages that are installed via pip install -r --user ~/path/to/requirements.txt? However, this will place some of the onus on users to maintain their own requirements.txt file for their project.

R package install requests

rsample
recipes
caret
keras
h2o
broom
devtools
usethis
psych
ranger
lubridate
reticulate
ModelMetrics
tensorflow
jsonlite
rvest
rmarkdown
knitr

Install Python cudatoolkit

numba works fine for CPU. When trying to run an example GPU command, I receive the following:

NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit: library nvvm not found

SpaCy package broken

Getting an error while trying to load the basic English Model (module)

nlp = spacy.load('en')

Output:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-21-76abff010c5f> in <module>()
----> 1 nlp = spacy.load('en')

/usr/local/lib/python3.5/dist-packages/spacy/__init__.py in load(name, **overrides)
     13     if depr_path not in (True, False, None):
     14         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15     return util.load_model(name, **overrides)
     16 
     17 

/usr/local/lib/python3.5/dist-packages/spacy/util.py in load_model(name, **overrides)
    117     elif hasattr(name, 'exists'):  # Path or Path-like to model data
    118         return load_model_from_path(name, **overrides)
--> 119     raise IOError(Errors.E050.format(name=name))
    120 
    121 

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Seems like

python -m spacy download en

will solve the problem.

See:
explosion/spaCy#1721
or similar github issues

Serve the launchapp with Apache

In order to get rid of all this screen business. I propose configuring the launchapp to be served by apache. Because of docker permissions, apache will need to be configured such that the process runs as an appropriately privileged user.

current active image list is not correctly resolving the image type

because the images for dev, nogpu_dev, and prod are all identical, the script which is collecting the particular type of image via simple docker commands cannot tell which type was actually selected to launch that image.

for example, if you launched an image of type eri_nogpu_dev, the tags assocaited with that image are ['eri_dev:latest', 'eri_nogpu_dev:latest', 'eri_prod:latest'].

to make the list usable, we need a way to disambiguate them. currently we will always see only eri_dev:latest as the tag (and hence a eri_gpu type).

Keras visualization not working

When running the following code:

from keras.utils import plot_model
plot_model(model, to_file='model.png')

I receive the following error:
OSError: pydot failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.

From what I've read, GraphViz has to be installed with apt-get, not just with pip.

cache directory persistence in docker containers

in somewhat limited experience, I've found that one of the most important features of the dl libraries has been the caching / model checkpointing components. for example, shipping a trained model requires, at some point, persisting at least the weights, if not a pickled repr. of the entire model.

in practice I've always done this by creating a system-wide cache location a la /cache, or /var/cache/eri, though obviously the exact path isn't that important.

I would argue this is not stuff we should be putting in the /data directory simple on the basis of separation of concerns -- what we are putting here are artifacts and products rather than inputs. for space constraint reasons

I propose we basically repeat the /data setup with a new directory for caching results

Install git

Users may need command line access to git. Install git for the deep-learning images.

Install a new DL framework: MXNET

Apache MXNET is highly scalable, has better CPU and Memory utilization and directly written in C++.

In most benchmarking papers I read, MXNET beats tensorflow, pytorch and caffe2 in terms of performance. API looks very much like PyTorch.

https://mxnet.apache.org/

Kill Instance

I've forgotten the password for my no_gpu instance, please kill it

pre-install tensorflow models

tensorflow has separated out the examples and tutorials files used throughout its documentation into a single separately-versioned github repository. it is a sizable enough that it doesn't make sense for individual users to have, but also important enough that multiple users will likely want it at some point

this would be a prime target for something that belongs in a /shared directory. this may be a part of a bigger project in which we rename the /data directory to be /shared and then start creating conceptually isolated subdirectories (e.g. /shared/data, /shared/documentation)

Install Python numba for GPU

Supposedly, numba's GPU support was sufficient to significantly accelerate network graph operations in the latest Belair HPC deliverable. It would be nice to have it available to experiment with.

Password Protect Rstudio Environments

  • Do you want to request a feature or report a bug?
    Bug

  • What is the current behavior?
    Any user can click on an running "rstudio url" and view the contents of the rstudio session and make changes

  • If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.

  1. Go to the running ERI-GPU webpage
  2. Click on any running rstudio url
  3. Delete everything from environment and laugh
  • What is the expected behavior?
    Rstudio urls should be password protected the same way the jupyter notebooks are

  • What is the motivation / use case for changing the behavior?
    security

  • Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)

Add R-Studio Server to non-gpu container

I'm interested in being able to launch an R-Studio non-gpu instance to leverage the machine's cores/RAM. At this time, it looks like only jupyter notebook is available.

isolate gpus in images

currently we are unable to isolate GPUs in our containers. this is done relatively easy with the nvidia docker container images or things which have inherited from those (such as tensorflow's gpu image), e.g.

NV_GPU=1 nvidia-docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-gpu nvidia-smi

we need to be able to do the same

MS SQL Server Driver not found

This has not been resolved. We are getting basically this same error

https://stackoverflow.com/questions/44527452/cant-open-lib-odbc-driver-13-for-sql-server-sym-linking-issue

It appears like MS SQL ODBC drivers are not installed on the container.

https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-2017

FYI.. I have installed them on the root machine because we needed it ASAP but has not worked.

Originally posted by @oersoy1 in #52 (comment)

Reconfigure launch app for new GPU cards

Success is:

  1. one container pointing to 2 cards (what's installed already), call it Mult_GPU or something like that
  2. one container pointing to the 3rd GPU, call it GPU_3 or something like that
  3. one container pointing to the 4th GPU, call it GPU_4 or whatever...

So we will have 3 containers that have GPU access, available for training models.
This dev/prod was a good idea but in the end did not matter. lmk what you think
My thinking is that if you are learning or do not have a lot of data to train, use the single GPU instances if they are available. If you are working on a lot data or client data, use the multiple GPU instance and run the algorithm on 2 GPUs.

We could also have a 4 GPU instance for really serious work, i.e. mountains of text data. But that would mean no one else can work on it. But we can say you need to get permission from the admins to make sure that will be OK.

Containers do not see GPUs

  • Do you want to request a feature or report a bug?
    A Bug

  • What is the current behavior?
    Event though the user creates a container by attaching one or more GPUs, the DL framework (whether it is TF or Pytorch) does not see the GPU. We have duplicated this problem in both frameworks. User tries to train a model on the GPU and all CPU cores are utilized instead.

  • If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.

import torch
torch.cuda.device_count()

returns 0

  • What is the expected behavior?
    In this case my container has 2. Should return 2.

  • What is the motivation / use case for changing the behavior?

  • Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)
    This may be related with the new CUDA 10 Upgrade.. When I installed the new GPU cards on the machine 2 months ago I tested the frameworks and everything worked fine.

See here for more information on the Pytorch multi gpu processing.

Bus error (DataLoader worker killed) when using PyTorch in docker

Relevant error here:

Epoch number 0
 Current loss 2.129094362258911

Epoch number 1
 Current loss 1.1615623235702515

Epoch number 2
 Current loss 1.1757559776306152

Epoch number 3
 Current loss 1.1106539964675903

Epoch number 4
 Current loss 1.1064958572387695

Epoch number 5
 Current loss 1.1496989727020264

Epoch number 6
 Current loss 1.1117830276489258

Epoch number 7
 Current loss 1.1182820796966553

Epoch number 8
 Current loss 1.109954833984375

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-48-b071e9085377> in <module>()
      1 for epoch in range(0, Config.train_number_epochs):
----> 2     for i, data in enumerate(train_dataloader,0):
      3         img0, img1, label = data
      4         img0, img1, label = img0.cuda(), img1.cuda(), label.cuda()
      5         optimizer.zero_grad()

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    278         while True:
    279             assert (not self.shutdown and self.batches_outstanding > 0)
--> 280             idx, batch = self._get_batch()
    281             self.batches_outstanding -= 1
    282             if idx != self.rcvd_idx:

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in _get_batch(self)
    257                 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
    258         else:
--> 259             return self.data_queue.get()
    260 
    261     def __next__(self):

/usr/lib/python3.5/multiprocessing/queues.py in get(self)
    341     def get(self):
    342         with self._rlock:
--> 343             res = self._reader.recv_bytes()
    344         # unserialize the data after having released the lock
    345         return ForkingPickler.loads(res)

/usr/lib/python3.5/multiprocessing/connection.py in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

/usr/lib/python3.5/multiprocessing/connection.py in _recv_bytes(self, maxsize)
    405 
    406     def _recv_bytes(self, maxsize=None):
--> 407         buf = self._recv(4)
    408         size, = struct.unpack("!i", buf.getvalue())
    409         if maxsize is not None and size > maxsize:

/usr/lib/python3.5/multiprocessing/connection.py in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in handler(signum, frame)
    176         # This following call uses `waitid` with WNOHANG from C side. Therefore,
    177         # Python can still get and update the process status successfully.
--> 178         _error_if_any_worker_fails()
    179         if previous_handler is not None:
    180             previous_handler(signum, frame)

RuntimeError: DataLoader worker (pid 451) is killed by signal: Bus error.

Related issue on pytorch pytorch/pytorch#2244.

It appears that we need to increase the shared memory size when launching the containers by passing --shm-size=8G to docker run.

Permission denied err on eri_nogpu_dev

Can’t seem to create any files in Jupyter — “Permission denied: Untitled.ipynb” (eri_nogpu_dev:latest)

For the same user tshafer other docker images work fine.

create isolated landing directory for jupyter notebooks on a per-user basis

when a user launches a container, it would be good if we could have a user-level directory in which their notebooks live, perhaps in their home directory, and for their jupyter notebook server to start with that directory as their working directory.

this is something we probably need to have on the container level, so it might need to be parameterized in the image itself I think.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.