elderresearch / gpu_docker Goto Github PK
View Code? Open in Web Editor NEWdockerfiles for our dc office gpu development machine
dockerfiles for our dc office gpu development machine
Atom Hydrogen only has options to pass a session token or cookie when connecting to a remote kernel. The IPython.passwd()
function adds a random salt to the sha1 hash making it non-repeatable and therefore the Hydrogen config params must change with each new session.
A workaround is to create our own token by using a hash of the password and a predetermined salt.
salt = str(os.getenv('PORT', 8888))
h = hashlib.new('sha1')
h.update(password.encode() + salt.encode())
c.NotebookApp.token = h.hexdigest()
To duplicate the issue:
this is more a system issue than a gpu issue, but it feels like the best place to keep track of it.
proof: docker exec
into any running session and check groups
The creator of YOLO has a C-based, GPU-capable deep neural network framework called Darknet, available at https://pjreddie.com/darknet/. It would be nice to have on the GPU box to run object detection in addition to PyTorch, Keras, etc. This would likely be directly applicable to Project Iris.
Open Neural Network Exchange (ONNX) is an important standardization effort that allows models to be exchanged (import/export) between frameworks. As far as I can tell most developers of frameworks are going in this direction and support the idea. I believe all but Tensorflow support ONNX.
package description here
I'm not sure that this is strictly necessary. will be investigating independently today to form an opinion. I'd welcome anyone else's experience using tf hub
end goal is to have a button each user clicks to create a persistent container
We need to include nltk, SpaCy, and gensim in the python environments:
Justification: Most DNN frameworks do not include fast text mining data prep tools and developers recommend to use one or a combo of the above libraries.
astewart observed: a running docker image for eri_dev
was not appearing in the list of currently active sessions in launch.py
or on the launch app.
quick review indicated this occurred because the active_eri_images
function in launch.py
was looking for eri_dev:latest
, but the image itself was launched (from the cli) as just eri_dev
-- the :latest
was implicit.
easy fix
I've put together bare-minimum tests for each image (asserting python packages are installed, basically); we should figure out a way to run these in an automated way, either on image creation or afterwards as part of a CI process
This is needed so that we can write to the Client Database faster. User gets a GCC error.
tbradberry@6016a8c58149:~/repos/ltv$ pip install --user pyodbc
Collecting pyodbc
Using cached https://files.pythonhosted.org/packages/0f/aa/733a4326bfdef7deff954aa109ded6acf29d802a91fd87eedf6fc46fd91c/pyodbc-4.0.25.tar.gz
Building wheels for collected packages: pyodbc
Building wheel for pyodbc (setup.py) ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" bdist_wheel -d /tmp/pip-wheel-rnpp1ziq --python-tag cp35:
running bdist_wheel
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Failed building wheel for pyodbc
Running setup.py clean for pyodbc
Failed to build pyodbc
Installing collected packages: pyodbc
Running setup.py install for pyodbc ... error
Complete output from command /usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt --single-version-externally-managed --compile --user --prefix=:
running install
running build
running build_ext
building 'pyodbc' extension
creating build
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DPYODBC_VERSION=4.0.25 -I/usr/include/python3.5m -c src/buffer.cpp -o build/temp.linux-x86_64-3.5/src/buffer.o -Wno-write-strings
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
In file included from src/buffer.cpp:12:0:
src/pyodbc.h:56:17: fatal error: sql.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Command "/usr/bin/python3 -u -c "import setuptools, tokenize;file='/tmp/pip-install-_c1h_l1r/pyodbc/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-record-nl__x2ob/install-record.txt
for the time being this is a placeholder for setting up tensorboard
sessions for users with running jupyter
notebook containers. the simplest thing to do is just set up a similar port scenario (e.g. 6006 to prod, 6007 to dev, and then 6008 - 6017 for the non-gpu sessions). I don't know if it's overkill or not, and we definitely have to contend with the gpu resource issues. perhaps we pre-expose the port so that power users can start up their own tensorboard
session from the cli
In order to install SMAC3 (a state-of-the-art hyperparameter optimizer), I need build-essential and swig to be added to the docker image. They have to be installed with apt-get, not pip. See https://automl.github.io/SMAC3/stable/installation.html.
Perhaps launch.py
can ask for the location of a requirements.txt
for personal packages that are installed via pip install -r --user ~/path/to/requirements.txt
? However, this will place some of the onus on users to maintain their own requirements.txt
file for their project.
rsample
recipes
caret
keras
h2o
broom
devtools
usethis
psych
ranger
lubridate
reticulate
ModelMetrics
tensorflow
jsonlite
rvest
rmarkdown
knitr
numba
works fine for CPU. When trying to run an example GPU command, I receive the following:
NvvmSupportError: libNVVM cannot be found. Do
conda install cudatoolkit
: library nvvm not found
Getting an error while trying to load the basic English Model (module)
nlp = spacy.load('en')
Output:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-21-76abff010c5f> in <module>()
----> 1 nlp = spacy.load('en')
/usr/local/lib/python3.5/dist-packages/spacy/__init__.py in load(name, **overrides)
13 if depr_path not in (True, False, None):
14 deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15 return util.load_model(name, **overrides)
16
17
/usr/local/lib/python3.5/dist-packages/spacy/util.py in load_model(name, **overrides)
117 elif hasattr(name, 'exists'): # Path or Path-like to model data
118 return load_model_from_path(name, **overrides)
--> 119 raise IOError(Errors.E050.format(name=name))
120
121
OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Seems like
python -m spacy download en
will solve the problem.
See:
explosion/spaCy#1721
or similar github issues
In order to get rid of all this screen
business. I propose configuring the launchapp to be served by apache. Because of docker permissions, apache will need to be configured such that the process runs as an appropriately privileged user.
because the images for dev, nogpu_dev, and prod are all identical, the script which is collecting the particular type of image via simple docker commands cannot tell which type was actually selected to launch that image.
for example, if you launched an image of type eri_nogpu_dev
, the tags assocaited with that image are ['eri_dev:latest', 'eri_nogpu_dev:latest', 'eri_prod:latest']
.
to make the list usable, we need a way to disambiguate them. currently we will always see only eri_dev:latest
as the tag (and hence a eri_gpu
type).
I'm not sure if this is a joseph problem or a me problem (as in, what should I expect?)
url eri-gpu:8888 is not valid when connected via vpn, needs the fqdn (eri-gpu.cho.elderresearch.com)
When running the following code:
from keras.utils import plot_model
plot_model(model, to_file='model.png')
I receive the following error:
OSError: pydot
failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.
From what I've read, GraphViz has to be installed with apt-get
, not just with pip
.
in somewhat limited experience, I've found that one of the most important features of the dl libraries has been the caching / model checkpointing components. for example, shipping a trained model requires, at some point, persisting at least the weights, if not a pickled repr. of the entire model.
in practice I've always done this by creating a system-wide cache location a la /cache
, or /var/cache/eri
, though obviously the exact path isn't that important.
I would argue this is not stuff we should be putting in the /data
directory simple on the basis of separation of concerns -- what we are putting here are artifacts and products rather than inputs. for space constraint reasons
I propose we basically repeat the /data
setup with a new directory for caching results
Upgrading to CUDA 10 requires changing our driver version. See this answer on the nvidia-docker repo NVIDIA/nvidia-docker#829 (comment)
being used by lkent for her bench project
Users may need command line access to git. Install git for the deep-learning images.
Apache MXNET is highly scalable, has better CPU and Memory utilization and directly written in C++.
In most benchmarking papers I read, MXNET beats tensorflow, pytorch and caffe2 in terms of performance. API looks very much like PyTorch.
/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
make: *** [libdarknet.so] Error 1
apparently, the libcuda library can't be found.
I've forgotten the password for my no_gpu instance, please kill it
tensorflow
has separated out the examples and tutorials files used throughout its documentation into a single separately-versioned github repository. it is a sizable enough that it doesn't make sense for individual users to have, but also important enough that multiple users will likely want it at some point
this would be a prime target for something that belongs in a /shared
directory. this may be a part of a bigger project in which we rename the /data
directory to be /shared
and then start creating conceptually isolated subdirectories (e.g. /shared/data
, /shared/documentation
)
On the ERI-GPU front-end, any user can click on any launched "rstudio url" and make changes to the running rstudio environment
Containers have 3.5
Title says it all.
Supposedly, numba
's GPU support was sufficient to significantly accelerate network graph operations in the latest Belair HPC deliverable. It would be nice to have it available to experiment with.
could easily be done right now using cron
, but probably belongs in some more formal CI solution
Do you want to request a feature or report a bug?
Bug
What is the current behavior?
Any user can click on an running "rstudio url" and view the contents of the rstudio session and make changes
If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.
What is the expected behavior?
Rstudio urls should be password protected the same way the jupyter notebooks are
What is the motivation / use case for changing the behavior?
security
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)
future
digest
globals
listenv
parallel
I'm interested in being able to launch an R-Studio non-gpu instance to leverage the machine's cores/RAM. At this time, it looks like only jupyter notebook is available.
for the r-heads out there
currently we are unable to isolate GPUs in our containers. this is done relatively easy with the nvidia
docker container images or things which have inherited from those (such as tensorflow
's gpu image), e.g.
NV_GPU=1 nvidia-docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-gpu nvidia-smi
we need to be able to do the same
This has not been resolved. We are getting basically this same error
It appears like MS SQL ODBC drivers are not installed on the container.
FYI.. I have installed them on the root machine because we needed it ASAP but has not worked.
Originally posted by @oersoy1 in #52 (comment)
Success is:
So we will have 3 containers that have GPU access, available for training models.
This dev/prod was a good idea but in the end did not matter. lmk what you think
My thinking is that if you are learning or do not have a lot of data to train, use the single GPU instances if they are available. If you are working on a lot data or client data, use the multiple GPU instance and run the algorithm on 2 GPUs.
We could also have a 4 GPU instance for really serious work, i.e. mountains of text data. But that would mean no one else can work on it. But we can say you need to get permission from the admins to make sure that will be OK.
Do you want to request a feature or report a bug?
A Bug
What is the current behavior?
Event though the user creates a container by attaching one or more GPUs, the DL framework (whether it is TF or Pytorch) does not see the GPU. We have duplicated this problem in both frameworks. User tries to train a model on the GPU and all CPU cores are utilized instead.
If the current behavior is a bug, please provide the steps to reproduce and if possible a minimal demo of the problem.
import torch
torch.cuda.device_count()
returns 0
What is the expected behavior?
In this case my container has 2. Should return 2.
What is the motivation / use case for changing the behavior?
Other information (e.g. detailed explanation, stacktraces, related issues, suggestions how to fix, links for us to have context, eg. stackoverflow, gitter, etc)
This may be related with the new CUDA 10 Upgrade.. When I installed the new GPU cards on the machine 2 months ago I tested the frameworks and everything worked fine.
See here for more information on the Pytorch multi gpu processing.
our goal is to have one area where all users can store data in a shared space, and to administer that volume
Relevant error here:
Epoch number 0
Current loss 2.129094362258911
Epoch number 1
Current loss 1.1615623235702515
Epoch number 2
Current loss 1.1757559776306152
Epoch number 3
Current loss 1.1106539964675903
Epoch number 4
Current loss 1.1064958572387695
Epoch number 5
Current loss 1.1496989727020264
Epoch number 6
Current loss 1.1117830276489258
Epoch number 7
Current loss 1.1182820796966553
Epoch number 8
Current loss 1.109954833984375
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-48-b071e9085377> in <module>()
1 for epoch in range(0, Config.train_number_epochs):
----> 2 for i, data in enumerate(train_dataloader,0):
3 img0, img1, label = data
4 img0, img1, label = img0.cuda(), img1.cuda(), label.cuda()
5 optimizer.zero_grad()
/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in __next__(self)
278 while True:
279 assert (not self.shutdown and self.batches_outstanding > 0)
--> 280 idx, batch = self._get_batch()
281 self.batches_outstanding -= 1
282 if idx != self.rcvd_idx:
/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in _get_batch(self)
257 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
258 else:
--> 259 return self.data_queue.get()
260
261 def __next__(self):
/usr/lib/python3.5/multiprocessing/queues.py in get(self)
341 def get(self):
342 with self._rlock:
--> 343 res = self._reader.recv_bytes()
344 # unserialize the data after having released the lock
345 return ForkingPickler.loads(res)
/usr/lib/python3.5/multiprocessing/connection.py in recv_bytes(self, maxlength)
214 if maxlength is not None and maxlength < 0:
215 raise ValueError("negative maxlength")
--> 216 buf = self._recv_bytes(maxlength)
217 if buf is None:
218 self._bad_message_length()
/usr/lib/python3.5/multiprocessing/connection.py in _recv_bytes(self, maxsize)
405
406 def _recv_bytes(self, maxsize=None):
--> 407 buf = self._recv(4)
408 size, = struct.unpack("!i", buf.getvalue())
409 if maxsize is not None and size > maxsize:
/usr/lib/python3.5/multiprocessing/connection.py in _recv(self, size, read)
377 remaining = size
378 while remaining > 0:
--> 379 chunk = read(handle, remaining)
380 n = len(chunk)
381 if n == 0:
/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py in handler(signum, frame)
176 # This following call uses `waitid` with WNOHANG from C side. Therefore,
177 # Python can still get and update the process status successfully.
--> 178 _error_if_any_worker_fails()
179 if previous_handler is not None:
180 previous_handler(signum, frame)
RuntimeError: DataLoader worker (pid 451) is killed by signal: Bus error.
Related issue on pytorch pytorch/pytorch#2244.
It appears that we need to increase the shared memory size when launching the containers by passing --shm-size=8G
to docker run
.
Can we get the OpenCV library (cv2
) for Python added?
Can’t seem to create any files in Jupyter — “Permission denied: Untitled.ipynb” (eri_nogpu_dev:latest
)
For the same user tshafer
other docker images work fine.
pip install -U feather-format
when a user launches a container, it would be good if we could have a user-level directory in which their notebooks live, perhaps in their home directory, and for their jupyter notebook
server to start with that directory as their working directory.
this is something we probably need to have on the container level, so it might need to be parameterized in the image itself I think.
The default shell is currently sh
for all users. Change the default shell to bash
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.