coreweave / ml-containers Goto Github PK

View Code? Open in Web Editor NEW

17.0 19.0 3.0 218 KB

License: MIT License

Dockerfile 88.62% Fortran 5.89% Shell 5.49%

ml-containers's Introduction

ml-containers

Repository for building ML images at CoreWeave

Index

See the list of all published images.

Special PyTorch Images:

PyTorch Base Images
PyTorch Extras
PyTorch Nightly

PyTorch Base Images

ghcr.io/coreweave/ml-containers/torch

CoreWeave provides custom builds of PyTorch, torchvision and torchaudio tuned for our platform in a single container image, ml-containers/torch.

Versions compiled against CUDA 11.8.0, 12.0.1, 12.1.1, and 12.2.2 are available in this repository, with two variants:

base: Tagged as ml-containers/torch:a1b2c3d-base-....
1. Built from nvidia/cuda:...-base-ubuntu22.04 as a base.
2. Only includes essentials (CUDA, torch, torchvision, torchaudio), so it has a small image size, making it fast to launch.
nccl: Tagged as ml-containers/torch:a1b2c3d-nccl-....
1. Built from ghcr.io/coreweave/nccl-tests as a base.
2. Ultimately inherits from nvidia/cuda:...-cudnn8-devel-ubuntu22.04.
3. Larger, but includes development libraries and build tools such as nvcc necessary for compiling other PyTorch extensions.
4. These PyTorch builds are built on component libraries optimized for the CoreWeave cloud—see coreweave/nccl-tests.

Note

Most torch images have both a variant built on Ubuntu 22.04 and a variant built on Ubuntu 20.04.

CUDA 11.8.0 is an exception, and is only available on Ubuntu 20.04.
Ubuntu 22.04 images use Python 3.10.
Ubuntu 20.04 images use Python 3.8.
The base distribution is indicated in the container image tag.

PyTorch Extras

ghcr.io/coreweave/ml-containers/torch-extras

ml-containers/torch-extras extends the ml-containers/torch images with a set of common PyTorch extensions:

Each one is compiled specially against the custom PyTorch builds in ml-containers/torch.

Both base and nccl editions are available for ml-containers/torch-extras matching those for ml-containers/torch. The base edition retains a small size, as a multi-stage build is used to avoid including CUDA development libraries in it, despite those libraries being required to build the extensions themselves.

PyTorch Nightly

ml-containers/nightly-torch is an experimental, nightly release channel of the PyTorch Base Images in the style of PyTorch's own nightly preview builds, featuring the latest development versions of torch, torchvision, and torchaudio pulled daily from GitHub and compiled from source.

ml-containers/nightly-torch-extras is a version of PyTorch Extras built on top of the ml-containers/nightly-torch container images. These are not nightly versions of the extensions themselves, but rather match the extension versions in the regular PyTorch Extras containers.

⚠ The PyTorch Nightly containers are based on unstable, experimental preview builds of PyTorch, and should be expected to contain bugs and other issues. For more stable containers use the PyTorch Base Images and PyTorch Extras containers.

Organization

This repository contains multiple container image Dockerfiles, each is expected to be within its own folder along with any other needed files for the build.

CI Builds (Actions)

The current CI builds are set up to run when changes to files in the respective folders are detected so that only the changed container images are built. The actions are set up with an action per image utilizing a reusable base action build.yml. The reusable action accepts several inputs:

folder - the folder containing the dockerfile for the image
image-name - the name to use for the image
build-args - arguments to pass to the docker build

Images built using the same source can utilize one action as the main reason for the multiple actions is to handle only building the changed images. A build matrix can be helpful for these cases https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs.

ml-containers's People

Contributors

Stargazers

Watchers

Forkers

eta0 sangstar inferqtejas

ml-containers's Issues

feature request: TransformerEngine in torch-extras

I'd like to try out TransformerEngine. The reason is that I profiled our Tensor Parallel layers and the communication happens sequentially with the computation, which makes it very slow as the number of TP shards increases. Megatron supports overlapping communication with computation, but it calls out to TransformerEngine to make it happen. I wasn't able to find any other way to overlap unless I used nvidia primitives.

~~I wasn't able to install TE in my own image, because the Github Actions run times out.~~
It worked using this command to install the dev build: RUN export MAX_JOBS=4 && export NVTE_WITH_USERBUFFERS=1 && export NVTE_FRAMEWORK=pytorch && pip3 install -v git+https://github.com/NVIDIA/TransformerEngine.git@main
4 jobs was a total guess. Not sure that NVTE_FRAMEWORK=pytorch does anything.

There's a catch here. TE 1.4, the latest stable which was released 5 days ago, forces FlashAttention<=2.4.2. The version check was recently bumped to 2.5.6: NVIDIA/TransformerEngine@965803c but that probably won't appear in TE stable for a while.

So if you add TE stable, then FA will have a version restriction. On the other hand, the nightly torch-extras image had a broken PyTorch 2/2 times I tried it. So it's up to you which image might want TE added, and which version of TE.

The image I normally use is torch-extras, NCCL, CUDA 12.2.2 (latest), Ubuntu 22.04.

PyTorch 2.2 is out with some nice torch.compile improvements

Scaling up self-hosted GitHub Actions runners

ml-containers uses a self-hosted GitHub Actions runner to build container images through CI. It is currently only capable of handling one job at a time, sequentially. As a consequence, complex builds with many variations such as ml-containers/torch are taking up to 7 hours per commit to finish their CI.

Very heavy commits slow down development, as it makes iteratively fixing bugs in a CI deployment impractical.

Either dedicating more resources to keep runners available or implementing some form of autoscaling like with actions-runner-controller may improve on the situation.

Warning from `requests` package when using any PyTorch application

Testcase:

clone https://github.com/pytorch-labs/gpt-fast
Run this in the gpt-fast folder

export MODEL_REPO=openlm-research/open_llama_7b
./scripts/prepare.sh $MODEL_REPO
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

Result:

/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

I solved this by running pip3 install --upgrade requests, which is the solution google gives.

Zero urgency; our image already fixes this, so it's not affecting anything.

Return of the `buildcache`

Could've sworn we resolved this already...

Add torch container builds to ml-containers

We need to be able to build both the base torch container and the nccl images from ml-containers so that they are documented, and accessible by us and the customers.

[feature-request] Support for JAX container

Concise Description:
I'd like to use JAX for distributed training of LLMs. In addition, the new release of Keras supports JAX as a backend in addition to TF.

Describe the solution you'd like
I'd like either a separate JAX container or jaxlib included in a TF container since the TF ecosystem (data loading, serving, etc) supports JAX.

Describe alternatives you've considered
I could install JAX on top of the PyT container.

Triton may have been broken for a while in the nightly builds

I receive this error:

[rank7]:     from triton.compiler.compiler import triton_key
[rank7]: torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[rank7]: ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py)

Same problem as this one: pytorch/pytorch#123042

It's been present since before Mar 14. Given the solution in that thread, I suspect PyTorch's interaction with Triton may have changed. The non-nightly torch-extras build is fine.

Unfortunately, I have no testcase; running a simple torch.compile works fine. Maybe it'll solve itself when Pytorch 2.3 comes out; I'll bump this issue if it's still present when 2.3 lands in torch-extras.

bad :buildcache tag for extras images

Haven't checked any of the other images, but here if you actually try to pull the "buildcache" tag it errors. That tag should really either be either available, or not listed in the repo. Normally I fix my versions but I'm just trying to pull something with a good pytorch install to run a terminal session; I understand the reasons to not use :latest or anything equivalent but in that case it really shouldn't be listed.

Hi, Would consider update an vllm image update to latest?

it introduce some nice new features.