Translations: 한국어
A new MLOps paradigm for deep learning development is proposed using Docker Compose with the aim of providing reproducible and easy-to-use interactive development environments for deep learning practitioners. Hopefully, the methods presented here will become best practice in both academia and industry.
If this is your first time using this project, follow these steps:
-
Install the NVIDIA CUDA driver appropriate for the target hardware. The CUDA toolkit is not necessary. If the driver has already been installed, check that the installed version is compatible with the target CUDA version. CUDA driver version mismatch is the single most common issue for new users. See the compatibility matrix for compatible versions of the CUDA driver and CUDA Toolkit.
-
Install Docker if not installed and update to a recent version compatible with Docker Compose V2. Docker incompatibility with Docker Compose V2 is a common issue as well. Note that Windows users may use WSL (Windows Subsystem for Linux). Cresset has been tested on Windows 11 WSL with the Windows CUDA driver and Docker Desktop. There is no need to install a separate WSL CUDA driver or Docker for Linux inside WSL. N.B. Windows Security real-time protection causes significant slowdown if enabled. Disable any active antivirus programs on Windows for best performance.
-
Linux host users should install Docker Compose V2 for Linux as described in https://docs.docker.com/compose/cli-command/#install-on-linux. Visit the website for the latest installation information. Installation does not require
root
permissions. Please check the version and architecture tags in the URL before installing. The following commands will install Docker Compose V2 (v2.3.4, Linux x86_64) for a single user on Linux hosts. Visit https://github.com/docker/compose/releases to find the latest versions.
# WSL users should instead enable "Use Docker Compose V2" on Docker Desktop for Windows.
mkdir -p ~/.docker/cli-plugins/
curl -SL https://github.com/docker/compose/releases/download/v2.3.4/docker-compose-linux-x86_64 -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose
-
Run
make env
on the terminal at project root to create a basic.env
file. The.env
file provides environment variables fordocker-compose.yaml
, allowing different users and machines to set their own variables as required. The.env
file is excluded from version control via.gitignore
by design. -
To build from source, set
BUILD_MODE=include
and set the CUDA Compute Capability (CCA) of the target hardware. Visit https://developer.nvidia.com/cuda-gpus#compute to find compute capabilities of NVIDIA GPUs. Visit https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities for an explanation of compute capability and its relevance. Note that the Docker cache will save previously built binaries if the given configurations are identical. -
Read the
docker-compose.yaml
file to fill in extra variables in.env
. Also, feel free to editdocker-compose.yaml
as necessary by changing session names, hostnames, etc. for different projects and configurations. Thedocker-compose.yaml
file provides reasonable default values but these can be overridden by values specified in the.env
file.
Example .env
file for user with username me
, user id 1000
, group id 1000
.
# Generated automatically by `make up`
GID=1000
UID=1000
IMAGE_NAME_FULL=full-me
# Fill in the below manually.
# `*_TAG` variables are used only if `BUILD_MODE=include`.
BUILD_MODE=include # Whether to build PyTorch from source.
CCA=8.6 # Compute capability. CCA=8.6 for RTX3090 and A100.
# CCA="7.5 8.6" # Example for multiple compute capabilities. Makes build slower.
PYTORCH_VERSION_TAG=v1.11.0 # Any `git` branch or tag name can be used.
TORCHVISION_VERSION_TAG=v0.12.0
TORCHTEXT_VERSION_TAG=v0.12.0
TORCHAUDIO_VERSION_TAG=v0.11.0
# Environment configurations.
LINUX_DISTRO=ubuntu
DISTRO_VERSION=20.04
CUDA_VERSION=11.5.1 # Must be compatible with hardware and CUDA driver.
CUDNN_VERSION=8
PYTHON_VERSION=3.9
MAGMA_VERSION=115 # Must match CUDA version.
MKL_MODE=include # For Intel CPUs.
-
Edit requirements in
reqs/apt-train.requirements.txt
andreqs/pip-train.requirements.txt
. These contain project package dependencies. Theapt
requirements are designed to resemble an ordinary Pythonrequirements.txt
file. -
Run
make up
to start the service. IfBUILD_MODE=include
, this may take a while. Themake
commands are defined in theMakefile
and target thefull
service by default. Please read theMakefile
for implementation details and usage. -
Run
make exec
to enter the interactive container environment. Then start coding.
The purpose of this section is to introduce a new paradigm for deep learning development. I hope that Cresset, or at least the ideas behind it, will eventually become best practice for small to medium-scale deep learning research and development.
Developing in local environments with conda
or pip
is commonplace in the deep learning community.
However, this risks rendering the development environment,
and the code meant to run on it, unreproducible.
This state of affairs is a serious detriment to scientific progress
that many readers of this article will have experienced at first-hand.
Docker containers are the standard method for providing reproducible programs across different computing environments. They create isolated environments where programs can run without interference from the host or from one another. See https://www.docker.com/resources/what-container for details.
But in practice, Docker containers are often misused.
Containers are meant to be transient and best practice dictates
that a new container be created for each run.
However, this is very inconvenient for development,
especially for deep learning applications,
where new libraries must constantly be installed and
bugs are often only evident at runtime.
This leads many researchers to develop inside interactive containers.
Docker users often have run.sh
files with commands such as
docker run -v my_data:/mnt/data -p 8080:22 -t my_container my_image:latest /bin/bash
(look familiar, anyone?) and use SSH to connect to running containers.
VSCode even provides a remote development mode to code inside containers.
The problem with this approach is that these interactive containers become just as unreproducible as local development environments. A running container cannot connect to a new port or attach a new volume. But if the computing environment within the container was created over several months of installs and builds, the only way to keep it is to save the container as an image and create a new container from the saved image. After a few iterations of this process, the resulting images become bloated and no less scrambled than the local environments that they were meant to replace.
Problems become even more evident when preparing for deployment. MLOps, defined as a set of practices that aims to deploy and maintain machine learning models reliably and efficiently, has gained enormous popularity of late as many practitioners have come to realize the importance of continuously maintaining ML systems long after the initial development phase ends.
However, bad practices such as those mentioned above mean that much coffee has been spilled turning research code into anything resembling a production-ready product. Often, even the original developers cannot recreate the same model after a few months. Many firms thus have entire teams dedicated to model translation, a huge expenditure.
To alleviate these problems, I propose the use of Docker Compose as a simple MLOps solution. Using Docker and Docker Compose, the entire training environment can be reproduced. Compose has not yet caught on in the deep learning community, possibly because it is usually advertised as a multi-container solution. This is a misunderstanding as it can be used for single-container development just as well.
A docker-compose.yaml
file is provided for easy management of containers.
Using the provided docker-compose.yaml
file will create an interactive environment,
providing a programming experience very similar to using a terminal on a remote server.
Integrations with popular IDEs (PyCharm, VSCode) are also available.
Moreover, it also allows the user to specify settings for both build and run,
removing the need to manage the environment with custom shell scripts.
Connecting a new volume or port is as simple as removing the current container,
adding a line in the docker-compose.yaml
file, then running make up
to create a new container from the same image.
Build caches allow new images to be built very quickly,
removing another barrier to Docker adoption, the long initial build time.
For more information on Compose, visit the documentation.
Docker Compose can also be used directly for deployment, including on the cloud, which is useful for small-scale deployments. See https://www.compose-spec.io. If and when large-scale deployments using Kubernetes becomes necessary, using reproducible Docker environments from the very beginning will accelerate the development process and smooth the path to MLOps adoption. Accelerating time-to-market by streamlining the development process is a competitive edge for any firm, whether lean startup or tech titan.
With luck, the techniques I propose here will enable
the deep learning community to "write once, train anywhere".
But even if I fail in persuading the majority of users of the merits of my method,
I may still spare many a hapless grad student from the
sisyphean labor of setting up their conda
environment,
only to have it crash and burn right before their paper submission is due.
Docker Compose is superior to using custom shell scripts for each environment. Not only does it gather all variables and commands for both build and run into a single file, but its native integration with Docker means that it makes complicated Docker build/run setups simple to implement and use.
I wish to emphasize that using Docker Compose this way is a general-purpose technique
that does not depend on anything about this project.
As an example, an image from the NVIDIA NGC PyTorch repository
has been used as the base image in ngc.Dockerfile
.
The NVIDIA NGC PyTorch images contain many optimizations
for the latest GPU architectures and provide
a multitude of pre-installed machine learning libraries.
For those starting new projects, using the latest NGC image is recommended.
To use the NGC images, use the following commands:
docker compose up -d ngc
docker compose exec ngc zsh
The only difference with the previous example is the session name.
The Docker Compose container environment can be used with popular Python IDEs, not just in the terminal. PyCharm and Visual Studio Code, both very popular in the deep learning community, are compatible with Docker Compose.
-
If you are using a remote server, first create a Docker context to connect your local Docker with the remote Docker.
-
PyCharm (Professional only): Both Docker and Docker Compose are natively available as Python interpreters. See tutorials for Docker and Compose for details. JetBrains Gateway can also be used to connect to running containers. JetBrains Fleet IDE, with much more advanced features, will become available in early 2022. N.B. PyCharm Professional and other JetBrains IDEs are available free of charge to anyone with a valid university e-mail address.
-
VSCode: Install the Remote Development extension pack. See tutorial for details.
-
Connecting to a running container by
ssh
will remove all variables set byENV
. This is becausesshd
starts a new environment, wiping out all previous variables. Usingdocker
/docker compose
to enter containers is strongly recommended. -
WSL users using Compose should disable
ipc: host
. WSL cannot use this option. -
torch.cuda.is_available()
will return a... UserWarning: CUDA initialization:...
error or the image will simply not start if the CUDA driver on the host is incompatible with the CUDA version on the Docker image. Either upgrade the host CUDA driver or downgrade the CUDA version of the image. Check the compatibility matrix to see if the host CUDA driver is compatible with the desired version of CUDA. Also check if the CUDA driver has been configured correctly on the host. The CUDA driver version can be found using thenvidia-smi
command.
-
MORE STARS. No Contribution Without Appreciation!
-
A method of building
Magma
from source would be appreciated. Currently, Cresset depends on themagma-cudaXXX
package provided in the PyTorch channel of Anaconda. -
Bug reports are welcome. Only the latest versions has been tested rigorously. Please raise an issue if there are any versions that do not build properly. However, please check that your host Docker, Docker Compose, and especially NVIDIA Driver are up-to-date before doing so. Also, note that some combinations of PyTorch version and CUDA environment may simply be impossible to build because of issues in the underlying source code.
-
Translations into other languages and updates to existing translations are welcome. Please make a separate
LANG.README.md
file and create a PR.